LLVM IR is an implementation of the static single assignment form, an intermediate representation where each variable is assigned exactly once.
In SSA complex expressions are broken down into a series of
assignments to variables. Consider
z = x + y;
for example. $0-$2
mean
temporary variables.
$0 = x
$1 = y
$2 = $0 + $1
z = $2
More than one assignment to the same variable is forbidden.
One way to model snippets such as
x = 0; x = 1;
is to only assign the value of
the last assignment to x
.
$0 = 0
$1 = $0
$2 = 1
x = $2
From this representation, it is visible that
$1
, the result of x = 0;
is not
used on the right hands side of any other assignments, which
tells the compiler that $1
and all the other
assignments needed to compute $1
can be
optimized away from the generated code.
There is also a special case when a variable has a different value depending on which execution path is taken.
if condition {
x = 0;
} else {
x = 1;
}
y = x;
At the point of y = x;
the value of
x
is 0
if the condition is true
and 1
if it is false. To express this in SSA, a
special phi
function is inserted.
┌───────────┐
│ condition │
└───────────┘
┌──────┴──────┐
┌────────┐ ┌────────┐
│ $0 = 0 │ │ $1 = 1 │
└────────┘ └────────┘
└──────┬──────┘
┌─────────────────┐
│ x = phi($0, $1) │
│ y = x │
└─────────────────┘
x = phi($0, $1)
means that the value of
x
is either $0
or
$1
depending on which branch is executed.
A phi
function can have any number of values,
if there are 3 branches with a different assignment in each,
the resulting function might look like
x = phi($0, $1, $2)
.
The Codegen
class encapsulates the logic for
generating LLVM IR using the LLVM
IRBuilder
API. First, an
LLVMContext
is created, which owns and manages
the data needed by the IRBuilder
and various
other LLVM APIs.
The generated IR is stored inside a
Module
, while the instructions are inserted
into the module by the IRBuilder
itself.
class Codegen {
llvm::LLVMContext context;
llvm::IRBuilder<> builder;
llvm::Module module;
};
The Codegen
class also stores the resolved tree
and takes the path of the source file in its constructor.
class Codegen {
std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolvedTree;
...
public:
Codegen(std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolvedTree,
std::string_view sourcePath);
llvm::Module *generateIR();
};
The stored Module
must be given a name upon
instantiation, so it's going to be called
"<translation_unit>"
. The path of the source
file is also inserted into the Module
as a
piece of additional information.
Codegen::Codegen(
std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolvedTree,
std::string_view sourcePath)
: resolvedTree(std::move(resolvedTree)),
builder(context),
module("<translation_unit>", context) {
module.setSourceFileName(sourcePath);
...
}
To help the LLVM backend determine how to generate the native executable, it is recommended to set the target triple of the module. This triple contains information about the target machine such as its CPU architecture and operating system.
Since the goal is to generate an executable native to the
machine the compiler runs on, the target triple is set to
the triple of the current platform, which can be retrieved
with the
llvm::sys::getDefaultTargetTriple()
helper.
Codegen::Codegen(...) ... {
...
module.setTargetTriple(llvm::sys::getDefaultTargetTriple());
}
The generation of the IR, similarly to the resolution happens in two passes. First, the function declarations are generated without their bodies, so when the bodies are generated in the next pass the module already knows about all of the functions, which makes them easier to call.
llvm::Module *Codegen::generateIR() {
for (auto &&function : resolvedTree)
generateFunctionDecl(*function);
for (auto &&function : resolvedTree)
generateFunctionBody(*function);
...
}
When both passes finish, a pointer to the module is returned
to the caller. Note that the owner of the
Module
cannot be changed, as the module also
depends on the LLVMContext
for information, so
the context and the module must be kept together.
llvm::Module *Codegen::generateIR() {
...
return &module;
}
To be able to generate accurate IR, the types of the
language need to be converted to types recognized by LLVM.
The
IRBuilder
interface provides various builtin
types including double
, which can be used for
the Number
type and void
, which
can be used for the corresponding type of the language.
llvm::Type *Codegen::generateType(Type type) {
if (type.kind == Type::Kind::Number)
return builder.getDoubleTy();
return builder.getVoidTy();
}
To generate a function, LLVM must know its type first. A function's type is composed of its return type and the list of types the function takes as parameters.
void Codegen::generateFunctionDecl(const ResolvedFunctionDecl &functionDecl) {
auto *retType = generateType(functionDecl.type);
std::vector<llvm::Type *> paramTypes;
for (auto &¶m : functionDecl.params)
paramTypes.emplace_back(generateType(param->type));
...
}
With the return type and the type of the parameters known, a
FunctionType
can be retrieved, which is later
used to create the actual function. The last parameter
passed to FunctionType::get()
is a flag that
indicates if the function is variadic or not.
A variadic function is a kind of function that takes an
unknown amount of parameters, but since such functions are
not supported by the language, this flag is always set to
false
.
void Codegen::generateFunctionDecl(const ResolvedFunctionDecl &functionDecl) {
...
auto *type = llvm::FunctionType::get(retType, paramTypes, false);
...
}
Besides the type of the function, LLVM also needs to know its linkage. Functions that are visible outside the module have external linkage, while functions visible only inside the module have internal linkage. Since this compiler only works with one module the linkage of a function has no visible effect. For simplicity, however, every function is created with external linkage.
With the type, the linkage and its identifier known, the function can be created inside the module. Since the module stores the function, it can be retrieved on demand at any time.
void Codegen::generateFunctionDecl(const ResolvedFunctionDecl &functionDecl) {
...
llvm::Function::Create(type, llvm::Function::ExternalLinkage,
functionDecl.identifier, module);
}
In generateFunctionBody()
first, the function
is retrieved from the module.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
auto *function = module->getFunction(functionDecl.identifier);
...
}
In LLVM IR a set of instructions that execute together is
placed inside a basic block. The basic block representing
the body of the function is called
entry
.
After creating the
entry
block of the function, it's set as the
insertion point of the builder, which means that every
instruction the builder generates is now placed inside this
block.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
auto *entryBB = llvm::BasicBlock::Create(context, "entry", function);
builder.SetInsertPoint(entryBB);
...
}
Parameter declarations are not known by the module, unlike the functions they belong to. To be able to reference them throughout the IR generation, their declarations are mapped to their values in the IR.
class Codegen {
...
std::map<const ResolvedDecl *, llvm::Value *> declarations;
...
};
For each of the parameters, its name is set first.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
int idx = 0;
for (auto &&arg : function->args()) {
const auto *paramDecl = functionDecl.params[idx].get();
arg.setName(paramDecl->identifier);
...
++idx;
}
...
}
Then a variable on the stack is allocated for the parameter
and its value is stored inside this variable. The
ResolvedDecl
is also mapped to this variable on
the stack.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
for (auto &&arg : function->args()) {
...
llvm::Value *var = allocateStackVariable(function, paramDecl->identifier);
builder.CreateStore(&arg, var);
declarations[paramDecl] = var;
...
}
...
}
After the parameters are handled, the function body can be generated.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
generateBlock(*functionDecl.body);
}
Allocating variables on the stack happens inside
allocateStackVariable()
. The variables are
inserted at the beginning of the entry
block,
so they are visible to every instruction inside that block.
By default, the IRBuilder
can only insert
instructions before another instruction or at the end of the
block.
The problem with inserting at the end of the block is that the variables always appear after they are used, which is an error. A different approach might be to always insert before the first instruction, but in that case, it's always the latest variable appearing first, so the order is reversed.
The idea is to insert a placeholder instruction at the
beginning of the entry
block and always insert
the stack variables before this placeholder.
class Codegen {
...
llvm::Instruction *allocaInsertPoint;
...
};
This way the latest variable is inserted before the placeholder and directly after the previous variable.
entry:
entry: %var1
entry: %var1 %var2
%placeholder %placeholder %placeholder
The placeholder is inserted in
generateFunctionBody()
right after setting the
insertion point to the entry block. Any instruction without
side effects is sufficient, but to stick to what is seen in
clang, the bit casting of an undefined value is inserted.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
builder.SetInsertPoint(entryBB);
llvm::Value *undef = llvm::UndefValue::get(builder.getInt32Ty());
allocaInsertPoint = new llvm::BitCastInst(undef, undef->getType(),
"alloca.placeholder", entryBB);
...
}
Since this placeholder is no longer needed once the generation of the function body is done, it is removed at the end of the function.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
allocaInsertPoint->eraseFromParent();
allocaInsertPoint = nullptr;
}
With the placeholder in place,
allocateStackVariable()
can create a new
builder, set its insertion point to the placeholder and
allocate the space for the given variable.
llvm::AllocaInst *
Codegen::allocateStackVariable(llvm::Function *function,
const std::string_view identifier) {
llvm::IRBuilder<> tmpBuilder(context);
tmpBuilder.SetInsertPoint(allocaInsertPoint);
return tmpBuilder.CreateAlloca(tmpBuilder.getDoubleTy(), nullptr, identifier);
}
Inside a block, instructions need to be generated for each statement. Once the first return statement is encountered, code generation stops, as instructions after the return statement are unreachable anyway.
void Codegen::generateBlock(const ResolvedBlock &block) {
for (auto &&stmt : block.statements) {
generateStmt(*stmt);
if (dynamic_cast<const ResolvedReturnStmt *>(stmt.get())) {
builder.ClearInsertionPoint();
break;
}
}
}
The same pattern is followed in Codegen
which
can be seen in Sema
. The
generateStmt()
method acts as the driver for
statement generation.
llvm::Value *Codegen::generateStmt(const ResolvedStmt &stmt) {
if (auto *expr = dynamic_cast<const ResolvedExpr *>(&stmt))
return generateExpr(*expr);
if (auto *returnStmt = dynamic_cast<const ResolvedReturnStmt *>(&stmt))
return generateReturnStmt(*returnStmt);
llvm_unreachable("unknown statement");
}
With return statements the execution of the function can come to an end at an arbitrary point inside the body. If the function is non-void, it can also return an arbitrary value.
The common pattern to handle this is to create a dedicated return variable on the stack and a dedicated return basic block.
class Codegen {
...
llvm::Value *retVal = nullptr;
llvm::BasicBlock *retBB = nullptr;
...
};
Both of these helpers are created in
generateFunctionBody()
before the arguments are
allocated so that the return variable is the first on the
stack. If the function doesn't return a value, there is no
need to allocate this variable.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
bool isVoid = functionDecl.type.kind == Type::Kind::Void;
if (!isVoid)
retVal = allocateStackVariable(function, "retval");
retBB = llvm::BasicBlock::Create(context, "return");
int idx = 0;
...
}
If there is at least one explicit return statement, the
return block needs to be inserted into the function.
Otherwise, nothing jumps to this block and the
ret
instructions can be inserted at the end of
the current block.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
if (retBB->hasNPredecessorsOrMore(1)) {
builder.CreateBr(retBB);
retBB->insertInto(function);
builder.SetInsertPoint(retBB);
}
...
}
If the function is void
, the corresponding
instruction is inserted. Otherwise, the function returns the
value stored in the dedicated return variable.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
if (isVoid) {
builder.CreateRetVoid();
return;
}
builder.CreateRet(builder.CreateLoad(builder.getDoubleTy(), retVal));
}
When the return statement is visited and it returns a value, the value is stored in the dedicated return variable. Finally, a jump to the return block is inserted.
llvm::Value *Codegen::generateReturnStmt(const ResolvedReturnStmt &stmt) {
if (stmt.expr)
builder.CreateStore(generateExpr(*stmt.expr), retVal);
return builder.CreateBr(retBB);
}
The generation of expressions is driven by
generateExpr()
. Since the method is expected to
handle every expression, its end is unreachable.
llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
...
llvm_unreachable("unexpected expression");
}
From a NumberLiteral
a constant floating point
value is generated with the corresponding value.
llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
if (auto *number = dynamic_cast<const ResolvedNumberLiteral *>(&expr))
return llvm::ConstantFP::get(builder.getDoubleTy(), number->value);
...
}
A DeclRefExpr
loads the value of the
corresponding declaration from memory.
llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
...
if (auto *dre = dynamic_cast<const ResolvedDeclRefExpr *>(&expr))
return builder.CreateLoad(builder.getDoubleTy(), declarations[dre->decl]);
...
}
Generating IR for a CallExpr
takes a little bit
more code, so it gets its dedicated generator method.
llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
...
if (auto *call = dynamic_cast<const ResolvedCallExpr *>(&expr))
return generateCallExpr(*call);
...
}
First, the callee is looked up from the module.
llvm::Value *Codegen::generateCallExpr(const ResolvedCallExpr &call) {
llvm::Function *callee = module->getFunction(call.callee->identifier);
...
}
Then the arguments are generated and collected into a list.
llvm::Value *Codegen::generateCallExpr(const ResolvedCallExpr &call) {
...
std::vector<llvm::Value *> args;
for (auto &&arg : call.arguments)
args.emplace_back(generateExpr(*arg));
...
}
Finally, the looked-up function is called with the generated arguments.
llvm::Value *Codegen::generateCallExpr(const ResolvedCallExpr &call) {
...
return builder.CreateCall(callee, args);
}
To print values, the built-in
println()
function calls the
printf()
function from libc
.
printf()
takes a format string as its first
argument followed by an unknown amount of arguments.
int printf(const char* format, ...);
Before the function can be called from the module, it must
be declared. The return type of the function is
Int32
, while
const char *
translates to
Int8Ptr
because a char
is an 8-bit
integer value and constness doesn't matter in LLVM IR. Also
because printf()
takes an unknown amount of
arguments, it's a variadic function, which must be kept in
mind when getting the function type.
void Codegen::generateBuiltinPrintBody() {
auto *type = llvm::FunctionType::get(builder.getInt32Ty(),
{builder.getInt8PtrTy()}, true);
...
}
With the type known, the function can be created too.
Because printf()
lives inside libc, it must be
created with external linkage for the linker to find it.
void Codegen::generateBuiltinPrintBody() {
...
auto *printf = llvm::Function::Create(type, llvm::Function::ExternalLinkage,
"printf", *module);
...
}
The function is called with the
"%.15g\n"
format string, which means that the
value is either converted to a decimal notation if possible
or gets printed in a floating point notation with a 15-digit
precision followed by a new line at the end.
void Codegen::generateBuiltinPrintlnBody(const ResolvedFunctionDecl &println) {
...
auto *format = builder.CreateGlobalStringPtr("%.15g\n");
...
}
The first and only variadic argument passed to
printf()
is the first argument of the built-in
println()
function.
void Codegen::generateBuiltinPrintlnBody(const ResolvedFunctionDecl &println) {
...
llvm::Value *param = builder.CreateLoad(
builder.getDoubleTy(), declarations[println.params[0].get()]);
...
}
Finally, the call to printf()
with the
generated arguments is inserted.
void Codegen::generateBuiltinPrintlnBody(const ResolvedFunctionDecl &println) {
...
builder.CreateCall(printf, {format, param});
}
This body of println()
is generated when the
function is seen in generateFunctionBody()
.
void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
...
if (functionDecl.identifier == "println")
generateBuiltinPrintlnBody(functionDecl);
else
generateBlock(*functionDecl.body);
...
}
The last thing that needs special handling is the
main()
function. Right now it's a
void
function but the clang driver attempts to
link the generated platform-specific code against the C
runtime library, which expects main()
to return
an int
, so the linking fails.
To overcome the problem, the current
main()
function is renamed to
__builtin_main()
and a new
main()
wrapper is generated that calls
__builtin_main()
and returns 0
.
std::unique_ptr<llvm::Module> Codegen::generateIR() {
...
generateMainWrapper();
return &module;
}
First, the main()
function is looked up from
the module and gets renamed to
__builtin_main()
.
void Codegen::generateMainWrapper() {
auto *builtinMain = module->getFunction("main");
builtinMain->setName("__builtin_main");
...
}
A new main()
function is created with its
expected int main()
signature and external
linkage.
void Codegen::generateMainWrapper() {
...
auto *main = llvm::Function::Create(
llvm::FunctionType::get(builder.getInt32Ty(), {}, false),
llvm::Function::ExternalLinkage, "main", *module);
...
}
Finally, a basic block is created for the body of the
function and the call to __builtin_main()
is
inserted followed by a return statement, that returns
0
.
void Codegen::generateMainWrapper() {
...
auto *entry = llvm::BasicBlock::Create(context, "entry", main);
builder.SetInsertPoint(entry);
builder.CreateCall(builtinMain);
builder.CreateRet(llvm::ConstantInt::getSigned(builder.getInt32Ty(), 0));
}