Name Resolution

The last step of the compilation pipeline on the frontend is reasoning about the meaning of the source code. A snippet might be accepted by the parser, but it still has no meaning.

fn wrong(x: number): void {
  println(y);
}

The above snippet is correct according to the grammar, however, it's unknown what y refers to. Is it a parameter? Is it a function, which the developer forgot to call? The semantic analyzer of a compiler is searching for the answer to these questions.

The process of figuring out what a name in the source code refers to is called name resolution.

fn printX(x: number): void {
          ^
          │
  println(x); // 'x' refers to the parameter 'x'
}

The idea is to traverse the AST and record every identifier introduced by a declaration into a so-called symbol table. The identifiers can also be called symbols, hence the symbol table name.

When a reference to a symbol is encountered, it is checked whether it has already been recorded. If the symbol is not found in the symbol table it's assumed the developer forgot to declare it and the compiler reports an error.

fn function(  // symbols: ['function']
  x: number,  // symbols: ['function', 'x']
  y: number,  // symbols: ['function', 'x', 'y']
): void {
  x + z;
      ^ error: 'z' is not a known symbol
}

Design Note

The difficulty of this problem can vary based on how the language is designed. C and C++ require every symbol to be declared before it can be referenced, so in those languages, the source file can be traversed from top to bottom and when a symbol is not found an error can be safely reported.

Other languages such as Kotlin or JavaScript (and your language) however allow the use of a symbol before it is declared. The following snippet for example is a valid Kotlin code.

fun main() {
    foo()
}

fun foo() {}

Similarly, this other snippet is valid in JavaScript.

foo()

function foo() {}

JavaScript also introduces a mechanism called hoisting, when function and var declarations are moved to the top of the source file before the source file is executed.

Still even moving every declaration to the top of the source file doesn't work in case of 2 functions are referencing each other in their bodies.

function foo() { bar() }
function bar() { foo() }

C and C++ solve this problem with the use of forward declarations, as in these languages a function can be declared any number of times, but it is only allowed to be defined exactly once.

void foo();
void bar() { foo(); }
void foo() { bar(); }

One drawback of forward declarations is that the compiler has no way to figure out whether a function is defined or not. Catching that error happens after compilation in the linker.

In languages where forward declarations are not allowed, to properly tackle this problem, name resolution must be done in multiple passes.

Scopes

Languages also define scopes, which limit the visibility of symbols. For example in C++ or Kotlin variables defined in the body of a condition are not visible outside that body.

void foo() {
    if(true)
        int x = 0;
    x; // error: 'x' was not declared in this scope
}

In Python, however, the variable is visible until the end of the function.

def fun():
    if True:
        x = 1
    print(x) # 'x' is visible

In your language function symbols declared on the top level of the source file live inside the global scope, which makes them visible to every part of the source file.

┌──────────────────────────┐
│ Global Scope: ['x', 'y'] │
├──────────────────────────┴────────────┐
│ fn x(n: number): void { println(n); } │
│                                       │
│ fn y(x: number): void { println(x); } │
└───────────────────────────────────────┘

Parameters on the other hand are only visible inside the parameter list they are declared in and the body of the corresponding function. The parameter of a specific function is not visible in any other function.

┌──────────────────────────┐
│ Global Scope: ['x', 'y'] │
├──────────────────────────┴─────────────────┐
│      ┌────────────────────┐                │
│      │ Param Scope: ['n'] │                │
│      ├────────────────────┴──────────────┐ │
│ fn x │ (n: number): void { println(n); } │ │
│      └───────────────────────────────────┘ │
│                                            │
│      ┌────────────────────┐                │
│      │ Param Scope: ['x'] │                │
│      ├────────────────────┴──────────────┐ │
│ fn y │ (x: number): void { println(x); } │ │
│      └───────────────────────────────────┘ │
└────────────────────────────────────────────┘

Blocks also get their scope, so that everything that's declared inside a block is not visible outside of it.

┌──────────────────────────┐
│ Global Scope: ['x', 'y'] │
├──────────────────────────┴─────────────────────┐
│      ┌────────────────────┐                    │
│      │ Param Scope: ['n'] │                    │
│      ├────────────────────┴──────────────────┐ │
│      │                   ┌─────────────────┐ │ │
│      │                   │ Block scope: [] │ │ │
│      │                   ├─────────────────┤ │ │
│ fn x │ (n: number): void │ { println(n); } │ │ │
│      │                   └─────────────────┘ │ │
│      └───────────────────────────────────────┘ │
│                                                │
│      ┌────────────────────┐                    │
│      │ Param Scope: ['x'] │                    │
│      ├────────────────────┴──────────────────┐ │
│      │                   ┌─────────────────┐ │ │
│      │                   │ Block Scope: [] │ │ │
│      │                   ├─────────────────┤ │ │
│ fn y │ (x: number): void │ { println(x); } │ │ │
│      │                   └─────────────────┘ │ │
│      └───────────────────────────────────────┘ │
└────────────────────────────────────────────────┘

Because scopes are nested inside each other and each scope needs to be aware of the declarations it contains, they are usually modelled as symbol tables linked after each other.

['x', 'y'] <- ['n'] <- []
           ^ 
           └─ ['x'] <- []

During lookup, the scopes are scanned from the inner-most one to the outer-most one, so it's always the declaration in the deepest scope found first.

Lookup 'x' in the body of 'fn x()':
['x', 'y'] <- ['n'] <- [] <-- lookup from right to left
  ^

Lookup 'x' in the body of 'fn y()':
['x', 'y'] <- ['x'] <- [] <-- lookup from right to left
                ^

Design Note

The previously described scope is also called lexical scope as the visibility of the variables depends on the place where they are created in source code.

For this reason in the case of functions, symbol tables can be created and destroyed as the semantic analyzer traverses them from top to bottom, however, other elements such as classes require their symbol tables to be bound to the class itself. The reason for this is inheritance.

struct Base {
  int x;
  void foo();
};

struct Derived : public Base {};

Because Derived inherits from Base, it must have access to the symbols of Base. If the symbol table of Base is destroyed when the semantic analyzer finishes processing the class, the information is lost.

If a symbol at the point of a declaration is already inserted into the same scope, it's called a redeclaration. Inside the compiler, it is an error, as there is no way to know which of the symbols a reference points to.

// symbol tables: ['function'] <- ['x', 'x'] <- []

fn function(x: number, x: number): void {
            ^          ^
          ┌─┴──────────┘ // which 'x' is being printed?
  println(x);
}

When the same symbol is already found in a higher symbol table it's called shadowing, as the deeper symbol is found first by the compiler and it hides the higher symbol.

// symbol tables: ['x', 'y'] <- ['x'] <- []

fn x(): void {}
   ^
   └─┐ // the parameter shadows the function
fn y(x: number): void {
     ^
  ┌──┘ // the closest 'x' symbol is the parameter
  x;
}

Resolved Tree

After resolution, the compiler should have an intermediate representation that contains information about the meaning of the source code. For example, a DeclRefExpr should no longer store a std::string that represents the identifier of the declaration being referenced, but a pointer to the corresponding Decl node itself.

One approach to achieve this is to add additional fields to the current AST and fill those in during semantic analysis. The drawback is that nodes can be left in an invalid state by forgetting to set the correct information. Also, the AST becomes too heavy with multiple properties that are no longer needed after resolution.

Another approach is to build a dedicated representation that only contains the information needed after resolution. This is the approach being preferred in this compiler. The representation created is also a tree structure similar to the AST, in fact, it could also be called resolved AST or simply resolved tree.

The ResolvedStmt node is identical to the AST Stmt as it also acts as the base class for every resolved statement and stores the location of the node and a dump() method.

struct ResolvedStmt {
  SourceLocation location;

  ResolvedStmt(SourceLocation location)
      : location(location) {}

  virtual ~ResolvedStmt() = default;

  virtual void dump(size_t level = 0) const = 0;
};

The ResolvedExpr node is different as every resolved expression must also have its type set.

struct ResolvedExpr : public ResolvedStmt {
  Type type;

  ResolvedExpr(SourceLocation location, Type type)
      : ResolvedStmt(location),
        type(type) {}
};

Similarly to resolved expressions, resolved declarations must also have their types set compared to their non-resolved counterparts.

struct ResolvedDecl {
  SourceLocation location;
  std::string identifier;
  Type type;

  ResolvedDecl(SourceLocation location, std::string identifier, Type type)
      : location(location),
        identifier(std::move(identifier)),
        type(type) {}
  virtual ~ResolvedDecl() = default;

  virtual void dump(size_t level = 0) const = 0;
};

Since numbers are all double precision floating point numbers, ResolvedNumberLiteral stores a double instead of the textual representation of its value.

struct ResolvedNumberLiteral : public ResolvedExpr {
  double value;

  ResolvedNumberLiteral(SourceLocation location, double value)
      : ResolvedExpr(location, Type::builtinNumber()),
        value(value) {}

  void dump(size_t level = 0) const override;
};

Its textual representation also prints this value.

void ResolvedNumberLiteral::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedNumberLiteral: '" << value << "'\n";
}

ResolvedDeclRefExpr stores a pointer to the referenced ResolvedDecl instead of its identifier.

struct ResolvedDeclRefExpr : public ResolvedExpr {
  const ResolvedDecl *decl;

  ResolvedDeclRefExpr(SourceLocation location, ResolvedDecl &decl)
      : ResolvedExpr(location, decl.type),
        decl(&decl) {}

  void dump(size_t level = 0) const override;
};

The textual representation of the node also contains the address of the referenced declaration besides its identifier.

void ResolvedDeclRefExpr::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedDeclRefExpr: @(" << decl << ") "
            << decl->identifier << '\n';
}

The callee of a ResolvedCallExpr is now a pointer to a ResolvedFunctionDecl instead of an arbitrary expression.

struct ResolvedCallExpr : public ResolvedExpr {
  const ResolvedFunctionDecl *callee;
  std::vector<std::unique_ptr<ResolvedExpr>> arguments;

  ResolvedCallExpr(SourceLocation location,
                   const ResolvedFunctionDecl &callee,
                   std::vector<std::unique_ptr<ResolvedExpr>> arguments)
      : ResolvedExpr(location, callee.type),
        callee(&callee),
        arguments(std::move(arguments)) {}

  void dump(size_t level = 0) const override;
};

In the textual representation of the node, the address of the function is printed too.

void ResolvedCallExpr::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedCallExpr: @(" << callee << ") "
            << callee->identifier << '\n';

  for (auto &&arg : arguments)
    arg->dump(level + 1);
}

The ResolvedBlock node doesn't contain any different information from its non-resolved counterpart.

struct ResolvedBlock {
  SourceLocation location;
  std::vector<std::unique_ptr<ResolvedStmt>> statements;

  ResolvedBlock(SourceLocation location,
                std::vector<std::unique_ptr<ResolvedStmt>> statements)
      : location(location),
        statements(std::move(statements)) {}

  void dump(size_t level = 0) const;
};

Its textual representation is also identical to the AST Block node.

void ResolvedBlock::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedBlock\n";

  for (auto &&stmt : statements)
    stmt->dump(level + 1);
}

With moving more information to its base class, ResolvedParamDecl no longer contains any unique fields.

struct ResolvedParamDecl : public ResolvedDecl {
  ResolvedParamDecl(SourceLocation location, std::string identifier, Type type)
      : ResolvedDecl{location, std::move(identifier), type} {}

  void dump(size_t level = 0) const override;
};

The textual representation of the node also contains its address.

void ResolvedParamDecl::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedParamDecl: @(" << this << ") "
            << identifier << ':' << '\n';
}

ResolvedFunctionDecl only stores a parameter list and the body of the function. The rest of the fields are stored in the ResolvedDecl base class.

struct ResolvedFunctionDecl : public ResolvedDecl {
  std::vector<std::unique_ptr<ResolvedParamDecl>> params;
  std::unique_ptr<ResolvedBlock> body;

  ResolvedFunctionDecl(SourceLocation location,
                       std::string identifier,
                       Type type,
                       std::vector<std::unique_ptr<ResolvedParamDecl>> params,
                       std::unique_ptr<ResolvedBlock> body)
      : ResolvedDecl(location, std::move(identifier), type),
        params(std::move(params)),
        body(std::move(body)) {}

  void dump(size_t level = 0) const override;
};

The textual representation of the node contains its name, address, identifier, parameters and body.

void ResolvedFunctionDecl::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedFunctionDecl: @(" << this << ") "
            << identifier << ':' << '\n';

  for (auto &&param : params)
    param->dump(level + 1);

  body->dump(level + 1);
}

Identical to its non-resolved counterpart ResolvedReturnStmt stores only the optional expression whose value is returned.

struct ResolvedReturnStmt : public ResolvedStmt {
  std::unique_ptr<ResolvedExpr> expr;

  ResolvedReturnStmt(SourceLocation location,
                     std::unique_ptr<ResolvedExpr> expr = nullptr)
      : ResolvedStmt(location),
        expr(std::move(expr)) {}

  void dump(size_t level = 0) const override;
};

Its textual representation is also almost identical to its non-resolved counterpart.

void ResolvedReturnStmt::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedReturnStmt\n";

  if (expr)
    expr->dump(level + 1);
}

Sema

The semantic analysis logic is encapsulated by the Sema class. It takes the AST returned by the parser and attempts to resolve it.

class Sema {
  std::vector<std::unique_ptr<FunctionDecl>> ast;

public:
  explicit Sema(std::vector<std::unique_ptr<FunctionDecl>> ast)
      : ast(std::move(ast)) {}

  std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolveAST();
};

A scope is represented by a vector of ResolvedDecl*, while the chain of scopes is represented by a vector of scopes.

class Sema {
  ...
  std::vector<std::vector<ResolvedDecl *>> scopes;
  ...
};

To make working with scopes more convenient, a helper object is introduced. The ScopeRAII makes managing the scopes natural by creating a new scope in its constructor and destroying it in its destructor.

class Sema {
  ...
  class ScopeRAII {
    Sema *sema;

  public:
    explicit ScopeRAII(Sema *sema)
        : sema(sema) {
      sema->scopes.emplace_back();
    }
    ~ScopeRAII() { sema->scopes.pop_back(); }
  };
  ...
};

The lookupDecl() method looks up a symbol based on its identifier and returns the ResolvedDecl and the index of the scope the declaration was found in. The index helps to determine whether an identifier collision is a redeclaration or a shadowing.

Design Note

With a more complex type system, it's also necessary to use the type of the symbol as well during the lookup, as a user-defined type might have the same name as a local variable.

class C {};
int C;

To look up a symbol, first, the symbol tables are iterated in reverse order while keeping track of the scope index, so the inner-most table is checked first.

std::pair<ResolvedDecl *, int> Sema::lookupDecl(const std::string id) {
  int scopeIdx = 0;
  for (auto it = scopes.rbegin(); it != scopes.rend(); ++it) {
    ...

    ++scopeIdx;
  }

  ...
}

For every scope, the declarations inside that scope are iterated and if the identifier of any declaration matches the searched identifier, the declaration and the index of the scope are returned.

std::pair<ResolvedDecl *, int> Sema::lookupDecl(const std::string id) {
  ...
  for (auto it = scopes.rbegin(); it != scopes.rend(); ++it) {
    for (auto &&decl : *it) {
      if (decl->identifier != id)
        continue;

      return {decl, scopeIdx};
    }

    ...
  }

  ...
}

If none of the scopes contains the declaration a nullptr is returned along with -1 as the index of the containing scope.

std::pair<ResolvedDecl *, int> Sema::lookupDecl(const std::string id) {
  ...

  return {nullptr, -1};
}

The insertDeclToCurrentScope() method attempts to insert the declaration into the inner-most scope. To handle redeclarations, it first checks if a declaration with the same identifier is already inserted into the current scope chain.

bool Sema::insertDeclToCurrentScope(ResolvedDecl &decl) {
  const auto &[foundDecl, scopeIdx] = lookupDecl(decl.identifier);

  ...
}

If the declaration is found in the inner-most scope an error is reported for redeclaring the same symbol. The method also returns false in this case to indicate the failure.

bool Sema::insertDeclToCurrentScope(ResolvedDecl &decl) {
  ...

  if (foundDecl && scopeIdx == 0) {
    report(decl.location, "redeclaration of '" + decl.identifier + '\'');
    return false;
  }

  ...
}

If the declaration is not found or it shadows another declaration in an outer scope, it is inserted into the inner-most one and true is returned to indicate success.

bool Sema::insertDeclToCurrentScope(ResolvedDecl &decl) {
  ...

  scopes.back().emplace_back(&decl);
  return true;
}

The source file is resolved in two passes. Because the order of the function declarations doesn't matter, first they are resolved without their bodies. This way all of the valid function symbols are inserted into the global scope.

Before the functions are resolved, however, the built-in println() function needs to be created and inserted into the global scope. Creating this function first allows the semantic analyzer to easily report an error if the println() is redeclared in the source code.

The resolveAST() method creates the global scope, the root of the resolved tree and inserts println() into both of them.

std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
  std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolvedTree;
  auto println = createBuiltinPrintln();
                  
  ScopeRAII globalScope(this);
  insertDeclToCurrentScope(*resolvedTree.emplace_back(std::move(println)));
  ...
}

The println() function is created at the <builtin>:0:0 location, with one parameter, n: number.

std::unique_ptr<ResolvedFunctionDecl> Sema::createBuiltinPrintln() {
  SourceLocation loc{"<builtin>", 0, 0};

  auto param =
      std::make_unique<ResolvedParamDecl>(loc, "n", Type::builtinNumber());

  std::vector<std::unique_ptr<ResolvedParamDecl>> params;
  params.emplace_back(std::move(param));

  ...
};

The return type of the function is void and its body is empty. This function node in the resolved tree only acts as a placeholder.

std::unique_ptr<ResolvedFunctionDecl> Sema::createBuiltinPrintln() {
  ...

  auto block = std::make_unique<ResolvedBlock>(
      loc, std::vector<std::unique_ptr<ResolvedStmt>>());

  return std::make_unique<ResolvedFunctionDecl>(
      loc, "println", Type::builtinVoid(), std::move(params), std::move(block));
};

After the built-in println() function is inserted into the resolved tree, the functions in the AST are resolved one by one.

std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
  ...

  for (auto &&fn : ast) {
    auto resolvedFunctionDecl = resolveFunctionDeclaration(*fn);
    ...
  }

  ...
}

If an error happens during the resolution of one of the functions or a function with the same name is declared multiple times, the error is recorded but the semantic analyzer keeps checking the other functions. This way independent errors are stacked together.

std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
  ...

  bool error = false;
  for (auto &&fn : ast) {
    ...

    if (!resolvedFunctionDecl ||
        !insertDeclToCurrentScope(*resolvedFunctionDecl)) {
      error = true;
      continue;
    }

    ...
  }

  ...
}

If a function is successfully resolved, it is inserted into the resolved tree.

std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
  ...

  for (auto &&fn : ast) {
    ...

    resolvedTree.emplace_back(std::move(resolvedFunctionDecl));
  }

  ...
}

If any error happens during this pass, an empty resolved tree is returned.

std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
  ...

  if (error)
    return {};

  ...
}

In the second pass, the body of every function in the resolved tree is resolved. The first function in the tree is the built-in println(), which can be skipped.

std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveSourceFile() {
  ...
  for (size_t i = 1; i < resolvedTree.size(); ++i) {
    ...
  }

  ...
}

The current function is also recorded inside the Sema class because to reason about some statements the analyzer needs to know which function the particular statement is in.

class Sema {
  ...
  ResolvedFunctionDecl *currentFunction;
  ...
};

Next, the current function is set and a new scope is created for its parameters, so they are visible to any DeclRefExpr that references them in the body.

std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveSourceFile() {
  ...
  for (size_t i = 1; i < resolvedTree.size(); ++i) {
    currentFunction = resolvedTree[i].get();
    
    ScopeRAII paramScope(this);
    for (auto &&param : currentFunction->params)
      insertDeclToCurrentScope(*param);

    ...
  }

  ...
}

Then the body is resolved. This is the only place where a resolved node is modified after its creation, as its body has to be replaced with the resolved body. If any of the bodies cannot be resolved, the error is recorded.

std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveSourceFile() {
  ...
  for (size_t i = 1; i < resolvedTree.size(); ++i) {
    ...

    auto resolvedBody = resolveBlock(*ast[i - 1]->body);
    if (!resolvedBody) {
      error = true;
      continue;
    }

    currentFunction->body = std::move(resolvedBody);
  }

  ...
}

If an error happens during the resolution of any of the blocks an empty resolved tree is returned, otherwise the valid resolved tree is returned.

std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveSourceFile() {
  ...

  if (error)
    return {};

  return resolvedTree;
}

With a simple type system like this one, a type is automatically considered resolved if it is void or number.

std::optional<Type> Sema::resolveType(Type parsedType) {
  if (parsedType.kind == Type::Kind::Custom)
    return std::nullopt;

  return parsedType;
}

In the case of functions, the return type is resolved first. If the resolution fails, an error is reported.

std::unique_ptr<ResolvedFunctionDecl>
Sema::resolveFunctionDeclaration(const FunctionDecl &function) {
  std::optional<Type> type = resolveType(function.type);

  if (!type)
    return report(function.location, "function '" + function.identifier +
                                         "' has invalid '" +
                                         function.type.name + "' type");

  ...
};

If the current function is the main() function it is ensured that it is declared as void with no parameters.

std::unique_ptr<ResolvedFunctionDecl>
Sema::resolveFunctionDeclaration(const FunctionDecl &function) {
  ...

  if (function.identifier == "main") {
    if (type->kind != Type::Kind::Void)
      return report(function.location,
                    "'main' function is expected to have 'void' type");

    if (!function.params.empty())
      return report(function.location,
                    "'main' function is expected to take no arguments");
  }

  ...
};

Next, each of the parameters is resolved. A new scope is created for the parameters so that their redeclarations can be detected easily. If a parameter cannot be resolved, or is the redeclaration of a previous parameter the resolution of the function fails and a nullptr is returned.

std::unique_ptr<ResolvedFunctionDecl>
Sema::resolveFunctionDeclaration(const FunctionDecl &function) {
  ...

  std::vector<std::unique_ptr<ResolvedParamDecl>> resolvedParams;
  
  ScopeRAII paramScope(this);
  for (auto &&param : function.params) {
    auto resolvedParam = resolveParamDecl(*param);

    if (!resolvedParam || !insertDeclToCurrentScope(*resolvedParam))
      return nullptr;

    resolvedParams.emplace_back(std::move(resolvedParam));
  }

  ...
};

If everything is successful, a ResolvedFunctionDecl is returned without a body node, which is added later.

std::unique_ptr<ResolvedFunctionDecl>
Sema::resolveFunctionDeclaration(const FunctionDecl &function) {
  ...

  return std::make_unique<ResolvedFunctionDecl>(
      function.location, function.identifier, *type, std::move(resolvedParams),
      nullptr);
};

For a ParamDecl only its type needs to be resolved. If the resolution fails or the type is void an error is reported.

std::unique_ptr<ResolvedParamDecl>
Sema::resolveParamDecl(const ParamDecl &param) {
  std::optional<Type> type = resolveType(param.type);

  if (!type || type->kind == Type::Kind::Void)
    return report(param.location, "parameter '" + param.identifier +
                                      "' has invalid '" + param.type.name +
                                      "' type");

  return std::make_unique<ResolvedParamDecl>(param.location, param.identifier,
                                             *type);
}

Resolving a block is equivalent to resolving all the statements inside that block. For each block, a new scope is created and if an error happens during the resolution of one statement, the semantic analyzer keeps resolving the remaining statements. This allows the compiler to report multiple error messages together at the cost of some introduced false positives.

std::unique_ptr<ResolvedBlock> Sema::resolveBlock(const Block &block) {
  std::vector<std::unique_ptr<ResolvedStmt>> resolvedStatements;

  bool error = false;
  ...

  ScopeRAII blockScope(this);
  for (auto &&stmt : block.statements) {
    auto resolvedStmt = resolveStmt(*stmt);

    error |= !resolvedStatements.emplace_back(std::move(resolvedStmt));
    if (error)
      continue;

    ...
  }

  if (error)
    return nullptr;

  return std::make_unique<ResolvedBlock>(block.location,
                                         std::move(resolvedStatements));
}

This function is the place where unreachable statements can be detected too. After the first return statement in a block, every other statement is considered unreachable, but a warning should only be reported for the first such statement.

std::unique_ptr<ResolvedBlock> Sema::resolveBlock(const Block &block) {
  ...
  int reportUnreachableCount = 0;

  ...
  for (auto &&stmt : block.statements) {
    ...

    if (reportUnreachableCount == 1) {
      report(stmt->location, "unreachable statement", true);
      ++reportUnreachableCount;
    }

    if (dynamic_cast<ReturnStmt *>(stmt.get()))
      ++reportUnreachableCount;
  }

  ...
}

Design Note

Some languages like C or C++ allow the declaration of labels to which the program can jump. If a label appears after a return statement, only the statements after the return and before the label are unreachable.

void foo() {
    return;
    int unreachable;

label:
    int mightBeReachable;
}

Resolving Statements

The resolveStmt() method drives the resolution of the various statements. It checks what the current Stmt is, and dispatches the corresponding resolver method. Because this method is responsible for handling every statement, its end should be unreachable, which is marked with the llvm_unreachable() utility.

std::unique_ptr<ResolvedStmt> Sema::resolveStmt(const Stmt &stmt) {
  if (auto *expr = dynamic_cast<const Expr *>(&stmt))
    return resolveExpr(*expr);

  if (auto *returnStmt = dynamic_cast<const ReturnStmt *>(&stmt));
    return resolveReturnStmt(*returnStmt);
  
  llvm_unreachable("unexpected statement");
}

For a ReturnStmt it needs to be checked whether it's in the correct form. Void functions are not allowed to return values, while non-void functions are expected to return a value of the return type of the function.

std::unique_ptr<ResolvedReturnStmt>
Sema::resolveReturnStmt(const ReturnStmt &returnStmt) {
  if (currentFunction->type.kind == Type::Kind::Void && returnStmt.expr)
    return report(returnStmt.location,
                  "unexpected return value in void function");

  if (currentFunction->type.kind != Type::Kind::Void && !returnStmt.expr)
    return report(returnStmt.location, "expected a return value");

  ...
}

If the return statement is in the correct form, the expression, whose result is returned is resolved. If the expression cannot be resolved, or the type of the expression doesn't match the return type of the current function, an error is reported.

std::unique_ptr<ResolvedReturnStmt>
Sema::resolveReturnStmt(const ReturnStmt &returnStmt) {
  ...

  std::unique_ptr<ResolvedExpr> resolvedExpr;
  if (returnStmt.expr) {
    resolvedExpr = resolveExpr(*returnStmt.expr);
    if (!resolvedExpr)
      return nullptr;

    if (currentFunction->type.kind != resolvedExpr->type.kind)
      return report(resolvedExpr->location, "unexpected return type");
  }

  return std::make_unique<ResolvedReturnStmt>(returnStmt.location,
                                              std::move(resolvedExpr));
}

Resolving Expressions

For expression resolution resolveExpr() acts as the driver. It checks what the current expression is and decides how the semantic analyzer should proceed. As it is responsible for handling every expression, the end of the function is unreachable similar to resolveStmt().

std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
  ...

  llvm_unreachable("unexpected expression");
}

A NumberLiteral is resolved by its value being converted into a double. The grammar ensures that its value is always a correct double literal, so the conversion always succeeds.

std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
  if (const auto *number = dynamic_cast<const NumberLiteral *>(&expr))
    return std::make_unique<ResolvedNumberLiteral>(number->location,
                                                   std::stod(number->value));

  ...
}

Resolving a DeclRefExpr is a more complex process, so it gets a dedicated method that handles it.

std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
  ...

  if (const auto *declRefExpr = dynamic_cast<const DeclRefExpr *>(&expr))
    return resolveDeclRefExpr(*declRefExpr);

  ...
}

First resolveDeclRefExpr() looks up the referenced symbol. If the symbol is not found, it wasn't declared in the source code, so an error is reported.

std::unique_ptr<ResolvedDeclRefExpr>
Sema::resolveDeclRefExpr(const DeclRefExpr &declRefExpr, ...) {
  ResolvedDecl *decl = lookupDecl(declRefExpr.identifier).first;
  if (!decl)
    return report(declRefExpr.location,
                  "symbol '" + declRefExpr.identifier + "' not found");

  ...
}

The method also takes a flag with a default value set to false indicating if it's the callee of a function call.

class Sema {
  ...
  std::unique_ptr<ResolvedDeclRefExpr>
  resolveDeclRefExpr(const DeclRefExpr &declRefExpr, bool isCallee = false);
  ...
};

This flag allows the semantic analyzer to catch the case when a function declaration is referenced outside a call expression. In that case, the programmer must have forgotten to call it and this error can be reported.

std::unique_ptr<ResolvedDeclRefExpr>
Sema::resolveDeclRefExpr(const DeclRefExpr &declRefExpr, bool isCallee) {
  ...

  if (!isCallee && dynamic_cast<ResolvedFunctionDecl *>(decl))
    return report(declRefExpr.location,
                  "expected to call function '" + declRefExpr.identifier + "'");

  ...
}

If no problem is found, the ResolvedDeclRefExpr is returned storing a pointer to the resolved declaration.

std::unique_ptr<ResolvedDeclRefExpr>
Sema::resolveDeclRefExpr(const DeclRefExpr &declRefExpr, bool isCallee) {
  ...

  return std::make_unique<ResolvedDeclRefExpr>(declRefExpr.location, *decl);
}

For a CallExpr first, it is checked that the callee is indeed a reference to a declaration and it's not an arbitrary expression being called.

std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
  const auto *dre = dynamic_cast<const DeclRefExpr *>(call.callee.get());
  if (!dre)
    return report(call.location, "expression cannot be called as a function");

  ...
}

If the callee is a reference to a declaration, it is resolved first, so that a possible calling of an unknown symbol is caught and then it is checked if this resolved reference is for a function declaration.

std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
  ...

  varOrReturn(resolvedCallee, resolveDeclRefExpr(*call.identifier, true));

  const auto *resolvedFunctionDecl =
      dynamic_cast<const ResolvedFunctionDecl *>(resolvedCallee->decl);

  if (!resolvedFunctionDecl)
    return report(call.location, "calling non-function symbol");

  ...
}

Then it's checked if the number of arguments in the call is equivalent to the number of parameters the called function has. If the argument count doesn't match, the semantic analyzer errors out.

std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
  ...

  if (call.arguments.size() != resolvedFunctionDecl->params.size())
    return report(call.location, "argument count mismatch in function call");

  ...
}

When the argument count matches, each of the arguments is resolved and its type is compared against the type of the corresponding parameter in the called function. If one of the arguments has a mismatching type, an error is reported.

std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
  ...

  std::vector<std::unique_ptr<ResolvedExpr>> resolvedArguments;
  int idx = 0;
  for (auto &&arg : call.arguments) {
    varOrReturn(resolvedArg, resolveExpr(*arg));

    if (resolvedArg->type.kind != resolvedFunctionDecl->params[idx]->type.kind)
      return report(resolvedArg->location, "unexpected type of argument");

    ++idx;
    resolvedArguments.emplace_back(std::move(resolvedArg));
  }

  ...
}

If both the callee and the arguments are resolved successfully, a ResolvedCallExpr is returned.

std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
  ...

  return std::make_unique<ResolvedCallExpr>(
      call.location, *resolvedFunctionDecl, std::move(resolvedArguments));
}