Static Single Assignment Form

LLVM IR is an implementation of the static single assignment form, an intermediate representation where each variable is assigned exactly once.

In SSA complex expressions are broken down into a series of assignments to variables. Consider z = x + y; for example. $0-$2 mean temporary variables.

$0 = x      
$1 = y      
$2 = $0 + $1
z = $2

More than one assignment to the same variable is forbidden. One way to model snippets such as x = 0; x = 1; is to only assign the value of the last assignment to x.

$0 = 0
$1 = $0
$2 = 1
x = $2

From this representation, it is visible that $1, the result of x = 0; is not used on the right hands side of any other assignments, which tells the compiler that $1 and all the other assignments needed to compute $1 can be optimized away from the generated code.

There is also a special case when a variable has a different value depending on which execution path is taken.

if condition {
  x = 0;
} else {
  x = 1;
}

y = x;

At the point of y = x; the value of x is 0 if the condition is true and 1 if it is false. To express this in SSA, a special phi function is inserted.

     ┌───────────┐
     │ condition │
     └───────────┘
    ┌──────┴──────┐
┌────────┐   ┌────────┐
│ $0 = 0 │   │ $1 = 1 │
└────────┘   └────────┘
    └──────┬──────┘    
  ┌─────────────────┐
  │ x = phi($0, $1) │
  │ y = x           │
  └─────────────────┘

x = phi($0, $1) means that the value of x is either $0 or $1 depending on which branch is executed.

A phi function can have any number of values, if there are 3 branches with a different assignment in each, the resulting function might look like x = phi($0, $1, $2).

Codegen

The Codegen class encapsulates the logic for generating LLVM IR using the LLVM IRBuilder API. First, an LLVMContext is created, which owns and manages the data needed by the IRBuilder and various other LLVM APIs.

The generated IR is stored inside a Module, while the instructions are inserted into the module by the IRBuilder itself.

class Codegen {
  llvm::LLVMContext context;
  llvm::IRBuilder<> builder;
  llvm::Module module;
};

The Codegen class also stores the resolved tree and takes the path of the source file in its constructor.

class Codegen {
  std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolvedTree;

  ...

public:
  Codegen(std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolvedTree,
          std::string_view sourcePath);

  llvm::Module *generateIR();
};

The stored Module must be given a name upon instantiation, so it's going to be called "<translation_unit>". The path of the source file is also inserted into the Module as a piece of additional information.

Codegen::Codegen(
    std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolvedTree,
    std::string_view sourcePath)
    : resolvedTree(std::move(resolvedTree)),
      builder(context),
      module("<translation_unit>", context) {
  module.setSourceFileName(sourcePath);
  ...
}

To help the LLVM backend determine how to generate the native executable, it is recommended to set the target triple of the module. This triple contains information about the target machine such as its CPU architecture and operating system.

Since the goal is to generate an executable native to the machine the compiler runs on, the target triple is set to the triple of the current platform, which can be retrieved with the llvm::sys::getDefaultTargetTriple() helper.

Codegen::Codegen(...) ... {
  ...
  module.setTargetTriple(llvm::sys::getDefaultTargetTriple());
}

The generation of the IR, similarly to the resolution happens in two passes. First, the function declarations are generated without their bodies, so when the bodies are generated in the next pass the module already knows about all of the functions, which makes them easier to call.

llvm::Module *Codegen::generateIR() {
  for (auto &&function : resolvedTree)
    generateFunctionDecl(*function);

  for (auto &&function : resolvedTree)
    generateFunctionBody(*function);

  ...
}

When both passes finish, a pointer to the module is returned to the caller. Note that the owner of the Module cannot be changed, as the module also depends on the LLVMContext for information, so the context and the module must be kept together.

llvm::Module *Codegen::generateIR() {
  ...

  return &module;
}

To be able to generate accurate IR, the types of the language need to be converted to types recognized by LLVM. The IRBuilder interface provides various builtin types including double, which can be used for the Number type and void, which can be used for the corresponding type of the language.

llvm::Type *Codegen::generateType(Type type) {
  if (type.kind == Type::Kind::Number)
    return builder.getDoubleTy();

  return builder.getVoidTy();
}

To generate a function, LLVM must know its type first. A function's type is composed of its return type and the list of types the function takes as parameters.

void Codegen::generateFunctionDecl(const ResolvedFunctionDecl &functionDecl) {
  auto *retType = generateType(functionDecl.type);

  std::vector<llvm::Type *> paramTypes;
  for (auto &&param : functionDecl.params)
    paramTypes.emplace_back(generateType(param->type));

  ...
}

With the return type and the type of the parameters known, a FunctionType can be retrieved, which is later used to create the actual function. The last parameter passed to FunctionType::get() is a flag that indicates if the function is variadic or not.

A variadic function is a kind of function that takes an unknown amount of parameters, but since such functions are not supported by the language, this flag is always set to false.

void Codegen::generateFunctionDecl(const ResolvedFunctionDecl &functionDecl) {
  ...
  auto *type = llvm::FunctionType::get(retType, paramTypes, false);
  ...
}

Besides the type of the function, LLVM also needs to know its linkage. Functions that are visible outside the module have external linkage, while functions visible only inside the module have internal linkage. Since this compiler only works with one module the linkage of a function has no visible effect. For simplicity, however, every function is created with external linkage.

With the type, the linkage and its identifier known, the function can be created inside the module. Since the module stores the function, it can be retrieved on demand at any time.

void Codegen::generateFunctionDecl(const ResolvedFunctionDecl &functionDecl) {
  ...
  llvm::Function::Create(type, llvm::Function::ExternalLinkage,
                         functionDecl.identifier, module);
}

In generateFunctionBody() first, the function is retrieved from the module.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  auto *function = module->getFunction(functionDecl.identifier);

  ...
}

In LLVM IR a set of instructions that execute together is placed inside a basic block. The basic block representing the body of the function is called entry.

After creating the entry block of the function, it's set as the insertion point of the builder, which means that every instruction the builder generates is now placed inside this block.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...

  auto *entryBB = llvm::BasicBlock::Create(context, "entry", function);
  builder.SetInsertPoint(entryBB);

  ...
}

Parameter declarations are not known by the module, unlike the functions they belong to. To be able to reference them throughout the IR generation, their declarations are mapped to their values in the IR.

class Codegen {
  ...
  std::map<const ResolvedDecl *, llvm::Value *> declarations;
  ...
};

For each of the parameters, its name is set first.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...

  int idx = 0;
  for (auto &&arg : function->args()) {
    const auto *paramDecl = functionDecl.params[idx].get();
    arg.setName(paramDecl->identifier);

    ...
    ++idx;
  }

  ...
}

Then a variable on the stack is allocated for the parameter and its value is stored inside this variable. The ResolvedDecl is also mapped to this variable on the stack.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...
  for (auto &&arg : function->args()) {
    ...

    llvm::Value *var = allocateStackVariable(function, paramDecl->identifier);
    builder.CreateStore(&arg, var);

    declarations[paramDecl] = var;
    ...
  }
  ...
}

After the parameters are handled, the function body can be generated.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...
  generateBlock(*functionDecl.body);
}

Allocating Stack Variables

Allocating variables on the stack happens inside allocateStackVariable(). The variables are inserted at the beginning of the entry block, so they are visible to every instruction inside that block.

By default, the IRBuilder can only insert instructions before another instruction or at the end of the block.

Certain optimization passes only work if all of the variables are allocated in the entry block of the function, so always allocating them in the most recent block is missing out on a better optimized code. A different approach might be to always insert before the first instruction, but in that case, it's always the latest variable appearing first, so the order is reversed.

The idea is to insert a placeholder instruction at the beginning of the entry block and always insert the stack variables before this placeholder.

class Codegen {
  ...
  llvm::Instruction *allocaInsertPoint;
  ...
};

This way the latest variable is inserted before the placeholder and directly after the previous variable.

                                        entry:        
                    entry:                %var1
entry:                %var1               %var2
  %placeholder        %placeholder        %placeholder

The placeholder is inserted in generateFunctionBody() right after setting the insertion point to the entry block. Any instruction without side effects is sufficient, but to stick to what is seen in clang, the bit casting of an undefined value is inserted.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...
  builder.SetInsertPoint(entryBB);

  llvm::Value *undef = llvm::UndefValue::get(builder.getInt32Ty());
  allocaInsertPoint = new llvm::BitCastInst(undef, undef->getType(),
                                            "alloca.placeholder", entryBB);

  ...
}

Since this placeholder is no longer needed once the generation of the function body is done, it is removed at the end of the function.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...

  allocaInsertPoint->eraseFromParent();
  allocaInsertPoint = nullptr;
}

With the placeholder in place, allocateStackVariable() can create a new builder, set its insertion point to the placeholder and allocate the space for the given variable.

llvm::AllocaInst *
Codegen::allocateStackVariable(llvm::Function *function,
                               const std::string_view identifier) {
  llvm::IRBuilder<> tmpBuilder(context);
  tmpBuilder.SetInsertPoint(allocaInsertPoint);

  return tmpBuilder.CreateAlloca(tmpBuilder.getDoubleTy(), nullptr, identifier);
}

Generating Statements

Inside a block, instructions need to be generated for each statement. Once the first return statement is encountered, code generation stops, as instructions after the return statement are unreachable anyway.

void Codegen::generateBlock(const ResolvedBlock &block) {
  for (auto &&stmt : block.statements) {
    generateStmt(*stmt);

    if (dynamic_cast<const ResolvedReturnStmt *>(stmt.get())) {
      builder.ClearInsertionPoint();
      break;
    }
  }
}

The same pattern is followed in Codegen which can be seen in Sema. The generateStmt() method acts as the driver for statement generation.

llvm::Value *Codegen::generateStmt(const ResolvedStmt &stmt) {
  if (auto *expr = dynamic_cast<const ResolvedExpr *>(&stmt))
    return generateExpr(*expr);

  if (auto *returnStmt = dynamic_cast<const ResolvedReturnStmt *>(&stmt))
    return generateReturnStmt(*returnStmt);

  llvm_unreachable("unknown statement");
}

With return statements the execution of the function can come to an end at an arbitrary point inside the body. If the function is non-void, it can also return an arbitrary value.

The common pattern to handle this is to create a dedicated return variable on the stack and a dedicated return basic block.

class Codegen {
  ...
  llvm::Value *retVal = nullptr;
  llvm::BasicBlock *retBB = nullptr;
  ...
};

Both of these helpers are created in generateFunctionBody() before the arguments are allocated so that the return variable is the first on the stack. If the function doesn't return a value, there is no need to allocate this variable.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...

  bool isVoid = functionDecl.type.kind == Type::Kind::Void;
  if (!isVoid)
    retVal = allocateStackVariable(function, "retval");
  retBB = llvm::BasicBlock::Create(context, "return");

  int idx = 0;
  ...
}

If there is at least one explicit return statement, the return block needs to be inserted into the function. Otherwise, nothing jumps to this block and the ret instructions can be inserted at the end of the current block.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...

  if (retBB->hasNPredecessorsOrMore(1)) {
    builder.CreateBr(retBB);
    retBB->insertInto(function);
    builder.SetInsertPoint(retBB);
  }

  ...
}

If the function is void, the corresponding instruction is inserted. Otherwise, the function returns the value stored in the dedicated return variable.

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...

  if (isVoid) {
    builder.CreateRetVoid();
    return;
  }

  builder.CreateRet(builder.CreateLoad(builder.getDoubleTy(), retVal));
}

When the return statement is visited and it returns a value, the value is stored in the dedicated return variable. Finally, a jump to the return block is inserted.

llvm::Value *Codegen::generateReturnStmt(const ResolvedReturnStmt &stmt) {
  if (stmt.expr)
    builder.CreateStore(generateExpr(*stmt.expr), retVal);

  return builder.CreateBr(retBB);
}

Generating Expressions

The generation of expressions is driven by generateExpr(). Since the method is expected to handle every expression, its end is unreachable.

llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
  ...

  llvm_unreachable("unexpected expression");
}

From a NumberLiteral a constant floating point value is generated with the corresponding value.

llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
  if (auto *number = dynamic_cast<const ResolvedNumberLiteral *>(&expr))
    return llvm::ConstantFP::get(builder.getDoubleTy(), number->value);

  ...
}

A DeclRefExpr loads the value of the corresponding declaration from memory.

llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
  ...

  if (auto *dre = dynamic_cast<const ResolvedDeclRefExpr *>(&expr))
    return builder.CreateLoad(builder.getDoubleTy(), declarations[dre->decl]);

 ...
}

Generating IR for a CallExpr takes a little bit more code, so it gets its dedicated generator method.

llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
  ...

  if (auto *call = dynamic_cast<const ResolvedCallExpr *>(&expr))
    return generateCallExpr(*call);

 ...
}

First, the callee is looked up from the module.

llvm::Value *Codegen::generateCallExpr(const ResolvedCallExpr &call) {
  llvm::Function *callee = module->getFunction(call.callee->identifier);

  ...
}

Then the arguments are generated and collected into a list.

llvm::Value *Codegen::generateCallExpr(const ResolvedCallExpr &call) {
  ...

  std::vector<llvm::Value *> args;
  for (auto &&arg : call.arguments)
    args.emplace_back(generateExpr(*arg));

  ...
}

Finally, the looked-up function is called with the generated arguments.

llvm::Value *Codegen::generateCallExpr(const ResolvedCallExpr &call) {
  ...

  return builder.CreateCall(callee, args);
}

Printing Values

To print values, the built-in println() function calls the printf() function from libc.

printf() takes a format string as its first argument followed by an unknown amount of arguments.

int printf(const char* format, ...);

Before the function can be called from the module, it must be declared. The return type of the function is Int32, while const char * translates to Int8Ptr because a char is an 8-bit integer value and constness doesn't matter in LLVM IR. Also because printf() takes an unknown amount of arguments, it's a variadic function, which must be kept in mind when getting the function type.

void Codegen::generateBuiltinPrintBody() {
  auto *type = llvm::FunctionType::get(builder.getInt32Ty(),
                                       {builder.getInt8PtrTy()}, true);
  ...
}

With the type known, the function can be created too. Because printf() lives inside libc, it must be created with external linkage for the linker to find it.

void Codegen::generateBuiltinPrintBody() {
  ...
  auto *printf = llvm::Function::Create(type, llvm::Function::ExternalLinkage,
                                        "printf", *module);
  ...
}

The function is called with the "%.15g\n" format string, which means that the value is either converted to a decimal notation if possible or gets printed in a floating point notation with a 15-digit precision followed by a new line at the end.

void Codegen::generateBuiltinPrintlnBody(const ResolvedFunctionDecl &println) {
  ...
  auto *format = builder.CreateGlobalStringPtr("%.15g\n");
  ...
}

The first and only variadic argument passed to printf() is the first argument of the built-in println() function.

void Codegen::generateBuiltinPrintlnBody(const ResolvedFunctionDecl &println) {
  ...
  llvm::Value *param = builder.CreateLoad(
      builder.getDoubleTy(), declarations[println.params[0].get()]);

  ...
}

Finally, the call to printf() with the generated arguments is inserted.

void Codegen::generateBuiltinPrintlnBody(const ResolvedFunctionDecl &println) {
  ...
  builder.CreateCall(printf, {format, param});
}

This body of println() is generated when the function is seen in generateFunctionBody().

void Codegen::generateFunctionBody(const ResolvedFunctionDecl &functionDecl) {
  ...

  if (functionDecl.identifier == "println")
    generateBuiltinPrintlnBody(functionDecl);
  else
    generateBlock(*functionDecl.body);

  ...
}

The Main Function

The last thing that needs special handling is the main() function. Right now it's a void function but the clang driver attempts to link the generated platform-specific code against the C runtime library, which expects main() to return an int, so the linking fails.

To overcome the problem, the current main() function is renamed to __builtin_main() and a new main() wrapper is generated that calls __builtin_main() and returns 0.

std::unique_ptr<llvm::Module> Codegen::generateIR() {
  ...

  generateMainWrapper();

  return &module;
}

First, the main() function is looked up from the module and gets renamed to __builtin_main().

void Codegen::generateMainWrapper() {
  auto *builtinMain = module->getFunction("main");
  builtinMain->setName("__builtin_main");

  ...
}

A new main() function is created with its expected int main() signature and external linkage.

void Codegen::generateMainWrapper() {
  ...

  auto *main = llvm::Function::Create(
      llvm::FunctionType::get(builder.getInt32Ty(), {}, false),
      llvm::Function::ExternalLinkage, "main", *module);

  ...
}

Finally, a basic block is created for the body of the function and the call to __builtin_main() is inserted followed by a return statement, that returns 0.

void Codegen::generateMainWrapper() {
  ...

  auto *entry = llvm::BasicBlock::Create(context, "entry", main);
  builder.SetInsertPoint(entry);

  builder.CreateCall(builtinMain);
  builder.CreateRet(llvm::ConstantInt::getSigned(builder.getInt32Ty(), 0));
}