The Abstract Syntax Tree

The previous section introduced how the source code is broken down into a list of tokens which are similar to the building blocks (nouns, verbs, etc.) of sentences in a spoken language. The This section talks about the parser. sentence is valid in the English language because the mentioned building blocks follow each other in the correct order. Similarly fn main(): void {} is a valid function declaration in your language for the same reason.

The parser validates if the tokens follow each other in a valid order and constructs the Abstract Syntax Tree (AST). The AST is an intermediate representation that encapsulates the structure of the source code and later helps the compiler to reason about what a piece of source code means.

So far the language only supports function declarations with no parameters, which can be modeled as a single AST node. To make extending the language easier, a Decl node is introduced first, which serves as the base class for all declarations. Each declaration has a source location and an identifier, so these values are stored in the base class.

struct Decl {
  SourceLocation location;
  std::string identifier;

  Decl(SourceLocation location, std::string identifier)
      : location(location),
        identifier(std::move(identifier)) {}
  virtual ~Decl() = default;

  virtual void dump(size_t level = 0) const = 0;
};

To make visualizing the AST easier, each node has a dump() method, that can be used to print the textual representation of that node.

struct Decl {
  ...

  virtual void dump(size_t level = 0) const = 0;
};

Currently, the only Decl in the language is the FunctionDecl, which in addition to what every declaration has in common, also has a return type and a body.

struct FunctionDecl : public Decl {
  Type type;
  std::unique_ptr<Block> body;

  FunctionDecl(SourceLocation location,
               std::string identifier,
               Type type,
               std::unique_ptr<Block> body)
      : Decl(location, std::move(identifier)),
        type(std::move(type)),
        body(std::move(body)) {}

  void dump(size_t level = 0) const override;
};

To make the dumping of the node easier the indent() helper is introduced, which returns the indentation of a given level. For the indentation of each level, 2 spaces are used.

std::string indent(size_t level) { return std::string(level * 2, ' '); }

The textual representation of a FunctionDecl includes the name of the node, the identifier that represents the function, the return type and the body of the function.

void FunctionDecl::dump(size_t level) const {
  std::cerr << indent(level) << "FunctionDecl: " << identifier << ':'
            << type.name << '\n';

  body->dump(level + 1);
}

The body of the function is represented by a Block node, which can currently only be empty, though this changes soon. A Block is a standalone node without a base class as there are no other elements in the language, that share the same functionality.

struct Block {
  SourceLocation location;

  Block(SourceLocation location)
      : location(location) {}

  void dump(size_t level = 0) const;
};

Because a Block doesn't have any child nodes, its textual representation only includes the name of the node.

void Block::dump(size_t level) const {
  std::cerr << indent(level) << "Block\n";
}

Design Note

Lately, some compiler engineers started using std::variant instead of inheritance to model the AST, where the variant acts as a union of nodes.

using Decl = std::variant<FunctionDecl, ParamDecl, VarDecl>;

The complexity of this design arises once the nodes need to reference each other, which will be unavoidable later with expressions like (1 + 2) * 3.

The ( ... ) is called a grouping expression and acts as a wrapper around other expressions, which is difficult to model using a variant.

using Expr = std::variant<..., GroupingExpr>;

struct GroupingExpr {
  Expr innerExpr;
};

The snippet above doesn't compile because the compiler needs to know the size of GroupingExpr. The size of the node however depends on the size of the std::variant, which contains the GroupingExpr too, so for the compiler to know the size of the std::variant, it has to know the size of the GroupingExpr first. The workaround is to use a pointer.

using Expr = std::variant<..., GroupingExpr>;

struct GroupingExpr {
  Expr *innerExpr;
};

In this case, the question is, who owns the memory for the innerExpr field? Who allocates it, who is responsible for freeing it, etc. The workaround for this problem is to use a std::unique_ptr.

struct GroupingExpr {
  std::unique_ptr<Expr> innerExpr;
};

Now it's clear that the node is the owner of its child node. However, to know the current type of the variant, innerExpr needs to be type-checked. The same type checking however could also be performed on the pointer itself if Expr was a polymorphic base class. To avoid complexities, this compiler keeps things simple and uses inheritance to model the AST.

Types

A type is a set of values with the same properties. The language only supports the void, number and custom types, which are encapsulated by the Type class. The actual kind of the type, that an instance of the Type class holds is represented by the Kind enum.

struct Type {
  enum class Kind { Void, Number, Custom };

  ...
};

An instance of the Type class holds one of these kinds and the textual representation of the corresponding type.

struct Type {
  ...

  Kind kind;
  std::string name;

  ...
};

To make the usage of the class safer and prevent accidental invalid type instantiation, the constructor of the class is private and a set of static member functions are provided that instantiate and return the correct Type.

struct Type {
  ...

  static Type builtinVoid() { return {Kind::Void, "void"}; }
  static Type builtinNumber() { return {Kind::Number, "number"}; }
  static Type custom(const std::string &name) { return {Kind::Custom, name}; }

private:
  Type(Kind kind, std::string name) : kind(kind), name(std::move(name)){};
};

The custom type acts as a placeholder for user-defined types, but it is not accepted as a valid type by the compiler for now.

Design Note

Theoretically, a function is also a separate type, so in a more complex language with a more complex type system, this should also be encapsulated somehow.

In C++ functions can also be stored in variables or passed as parameters, which calls for a dedicated function type. To be able to model the complexity of C++ types precisely, Clang uses a layer-based type system, where each layer is a different higher-level type.

An int * is represented using 2 layers, one for the int and one for the *.

int *x;

-PointerType 'int *'
 `-BuiltinType 'int'

A function with a complex parameter on the other hand can span across multiple layers.

int main(int argc, const char **argv)

FunctionProtoType 'int (int, const char **)'
|-BuiltinType 'int'
|-BuiltinType 'int'
`-PointerType 'const char **'
  `-PointerType 'const char *'
    `-QualType 'const char' const
      `-BuiltinType 'char'

The Parser

The parser iterates over the tokens returned by the lexer and constructs the AST of the program. As the lexer always points to the character that is to be processed next, the parser points to the token that is to be processed next.

The Parser takes the Lexer as a dependency and immediately invokes it upon instantiation to get the first token to process.

class Parser {
  Lexer *lexer;
  Token nextToken;

public:
  explicit Parser(Lexer &lexer)
      : lexer(&lexer),
        nextToken(lexer.getNextToken()) {}
};

Once the parser finishes processing the next token, it calls the eatNextToken() helper, which consumes it and calls the lexer for the following one.

void Parser::eatNextToken() { 
  nextToken = lexer->getNextToken(); 
}

The following figure visualizes how the parser processes the tokens and where the nextToken field points to.

'^' marks the content of the 'nextToken' field

Parser::Parser()
┌────┐ ┌──────┐ ┌───┐ ┌───┐ ┌───┐ ┌──────┐ ┌───┐ 
│ fn │ │ main │ │ ( │ │ ) │ │ : │ │ void │ │ { │ ...
└────┘ └──────┘ └───┘ └───┘ └───┘ └──────┘ └───┘ 
  ^

eatNextToken()
┌──────┐ ┌───┐ ┌───┐ ┌───┐ ┌──────┐ ┌───┐ ┌───┐
│ main │ │ ( │ │ ) │ │ : │ │ void │ │ { │ │ } │ ...
└──────┘ └───┘ └───┘ └───┘ └──────┘ └───┘ └───┘
  ^

eatNextToken()
┌───┐ ┌───┐ ┌───┐ ┌──────┐ ┌───┐ ┌───┐ ┌─────┐
│ ( │ │ ) │ │ : │ │ void │ │ { │ │ } │ │ EOF │
└───┘ └───┘ └───┘ └──────┘ └───┘ └───┘ └─────┘
  ^

The logic for processing the tokens lives in the parseSourceFile() method. The core of it is a loop that iterates until the Eof token is encountered.

// <sourceFile>
//  ::= <functionDecl>* EOF
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
  ...

  while (nextToken.kind != TokenKind::Eof) {
    ...
  }

  ...
}

Because so far the language only supports function declarations, which begin with fn, the parser checks if the next token to be processed is KwFn and parses the FunctionDecl accordingly. If the next token is not what it expects an error is reported to the user.

std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
  ...

  while (nextToken.kind != TokenKind::Eof) {
    if (nextToken.kind != TokenKind::KwFn) {
      report(nextToken.location,
             "only function declarations are allowed on the top level");
      ...
    }

    auto fn = parseFunctionDecl();
    ...
  }

  ...
}

Since the rest of the parser mostly works with pointers, the report() function always returns a nullptr to make stopping on errors more convenient. It also comes with an optional isWarning flag to denote that the message to report is not an error.

std::nullptr_t report(SourceLocation location, std::string_view message,
                  bool isWarning = false) {
const auto &[file, line, col] = location;

std::cerr << file << ':' << line << ':' << col << ':'
        << (isWarning ? " warning: " : " error: ") << message << '\n';

return nullptr;
}

The successfully parsed function declarations are then collected into a std::vector that also acts as the root of the AST.

std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
  std::vector<std::unique_ptr<FunctionDecl>> functions;

  while (nextToken.kind != TokenKind::Eof) {
    ...

    if (!fn) {
      ...
    }

    functions.emplace_back(std::move(fn));
  }

  ...
}

The source code might be invalid and the parser fails to process it completely. In that case, the AST is incomplete, which is marked by the incompleteAST flag.

class Parser {
  ...
  bool incompleteAST = false;
  ...
};

Once the processing of the tokens has finished, the successfully parsed functions are returned along with the indicator of whether the AST is complete or not.

std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
  ...

  return {std::move(functions), !incompleteAST};
}

Error recovery

Failing to parse a function declaration doesn't mean that the parsing of the whole file should be stopped. It's known that a function declaration always begins with the fn keyword and respectively, if the fn keyword is seen, it can only mean the beginning of a function. This information can be used in the parser to recover from errors.

┌────┐ ┌───┐ ┌───┐ ┌────┐ ┌──────┐ ┌───┐ ┌───┐ ┌───┐
│ fn │ │ : │ │ : │ │ fn │ │ main │ │ ( │ │ ) │ │ : │ ...
└────┘ └───┘ └───┘ └────┘ └──────┘ └───┘ └───┘ └───┘ 
         ^           ^
         |           └ the next function begins here
         └ error: unexpected token 

If an error happens, every token can be skipped until the next KwFn token or Eof is seen. This process is called the synchronization of the parser on the KwFn token and it is handled by the synchronizeOn method. When synchronization happens, the AST is guaranteed to be incomplete, so the corresponding flag is set too.

void Parser::synchronizeOn(TokenKind kind) {
  incompleteAST = true;

  while (nextToken.kind != kind && nextToken.kind != TokenKind::Eof)
    eatNextToken();
}

The parser synchronizes in the parseSourceFile() method every time it encounters an unexpected token or it fails to parse a function declaration.

std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
  ...

    if (nextToken.kind != TokenKind::KwFn) {
      ...
      synchronizeOn(TokenKind::KwFn);
      continue;
    }

    ...

    if (!fn) {
      synchronizeOn(TokenKind::KwFn);
      continue;
    }

  ...
}

Parsing Functions

A function declaration always starts with the KwFn token, which is followed by the Identifier, (, ), : tokens, a Type specifier and a Block at the end.

To make the work in the parser more convenient, the varOrReturn() and matchOrReturn() helper macros are introduced.

The former checks if the parsing of a node was successful and creates a variable that stores the result, or returns a nullptr on error.

#define varOrReturn(var, init)               \
  auto var = (init);                         \
  if (!var)                                  \
    return nullptr;

The latter checks if nextToken is of the expected kind and reports an error otherwise. The report() helper also returns a nullptr, so the parsing of the current node stops too.

#define matchOrReturn(tok, msg)              \
  if (nextToken.kind != tok)                 \
    return report(nextToken.location, msg);

The parseFunctionDecl() method expects the current token to be KwFn, saves its location as the beginning of the function and checks if the rest of the tokens are in the correct order.

// <functionDecl>
//  ::= 'fn' <identifier> '(' ')' ':' <type> <block>
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
  SourceLocation location = nextToken.location;
  eatNextToken(); // eat fn

  ...
}

The fn keyword is followed by an identifier. For the Identifier token, the textual representation of the identifier is stored in the value field. This value is saved before the token is eaten.

std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
  ...

  matchOrReturn(TokenKind::Identifier, "expected identifier");

  std::string functionIdentifier = *nextToken.value;
  eatNextToken(); // eat identifier

  ...
}

The next tokens denoting the start and end of the argument list are single-character tokens, which don't require any special handling.

std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
  ...

  matchOrReturn(TokenKind::Lpar, "expected '('");
  eatNextToken(); // eat '('

  matchOrReturn(TokenKind::Rpar, "expected ')'");
  eatNextToken(); // eat ')'

  ...
}

The argument list is followed by a : and a type specifier, which is parsed by a dedicated parseType() helper.

std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
  ...
  matchOrReturn(TokenKind::Colon, "expected ':'");
  eatNextToken(); // eat ':'
  
  varOrReturn(type, parseType());

  ...
}

Finally, the Block is parsed by the parseBlock() method. Similarly to the current method, parseBlock() also expects the first token to be the start of the block, so that token is checked before calling the function.

std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
  ...

  matchOrReturn(TokenKind::Lbrace, "expected function body");
  varOrReturn(block, parseBlock());

  ...
}

If everything is successful, the FunctionDecl node is returned.

std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
  ...
  return std::make_unique<FunctionDecl>(location, functionIdentifier, *type,
                                        std::move(block));
}

Parsing the type has been extracted into a dedicated helper method so that it can be reused later when the language is extended. The number type is handled in a later chapter as so far no token can represent it.

This method checks if the current token is KwVoid or Identifier and returns the corresponding void or custom types and reports an error otherwise.

// <type>
//  ::= 'number'
//  |   'void'
//  |   <identifier>
std::optional<Type> Parser::parseType() {
  TokenKind kind = nextToken.kind;

  if (kind == TokenKind::KwVoid) {
    eatNextToken(); // eat 'void'
    return Type::builtinVoid();
  }

  if (kind == TokenKind::Identifier) {
    auto t = Type::custom(*nextToken.value);
    eatNextToken(); // eat identifier
    return t;
  }

  report(nextToken.location, "expected type specifier");
  return std::nullopt;
};

The parseBlock() method is also simple at this point. It checks for the correct token order and returns a Block upon success or a nullptr upon error.

// <block>
//  ::= '{' '}'
std::unique_ptr<Block> Parser::parseBlock() {
  SourceLocation location = nextToken.location;
  eatNextToken(); // eat '{'

  matchOrReturn(TokenKind::Rbrace, "expected '}' at the end of a block");
  eatNextToken(); // eat '}'

  return std::make_unique<Block>(location);
}

Every source file must contain a main() function. If no such function is found, an error should be reported on the Eof token. Because the parser is the last stage of the compilation pipeline that has access to the tokens, the error is emitted at the end of parseSourceFile().

To implement this, the parser first iterates over the successfully parsed functions and remembers if any of them is called main.

std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
  ...
  while (...) {
    ...
  }
  
  bool hasMainFunction = false;
  for (auto &&fn : functions)
    hasMainFunction |= fn->identifier == "main";

  ...
}

If main() is not found and the AST is complete an error is reported. In the case of an incomplete AST it might have been parsing the main() function that caused the syntax error, so nothing is reported to avoid false positives.

std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
  ...

  if (!hasMainFunction && !incompleteAST)
    report(nextToken.location, "main function not found");

  ...
}

The returned indicator is also updated such that the AST is only considered complete if it contains the main() function too.

std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
  Parser::parseSourceFile() {
    ...
  
    return {..., !incompleteAST && hasMainFunction};
  }

Grammar

Like spoken languages, programming languages also have a grammar, which describes which order of tokens is valid in the language and which isn't. The parser is an automaton that recognizes this grammar and only accepts source files, which are valid according to it.

For example fn main(): void {} is a valid function declaration, while void main() {} is not recognized as a function declaration, because the type annotation is missing.

There are multiple formats for defining the grammar of a programming language, one of which is the Backus-Naur Form or BNF, which could already be seen as comments above the parser methods.

The BNF defines terminal and non-terminal symbols. Non-terminal symbols are usually enclosed by < and > and they can be extended further.

The conventional notation to describe symbols looks like <non-terminal> ::= expansion | expansion | expansion . An expansion can consist of multiple terminal and non-terminal symbols as well. To express the number of times a symbol can occur after each other, ? is used to denote 0-1 occurrence and * is used to denote any number of occurrences. If an element is not followed by one of these symbols, that means it occurs exactly once.

<sourceFile>
  ::= <digit>*
  |   <digit>? 'Hello World'

<digit>
  ::= '0' 
  |   '1'

A parser that recognizes the grammar defined by the BNF above can accept the following inputs.

input:                 'Hello World'
<sourceFile>: <digit>? 'Hello World'

input:           0     'Hello World'
<sourceFile>: <digit>? 'Hello World'
<digit>:        '0'            

input:           1     0     1
<sourceFile>: <digit>*
<digit>:        '1' 
<digit>:              '0'
<digit>:                    '1'

input:           (     )
<sourceFile>: no expansions can be applied

Language Design

Understanding how the parser works is crucial to designing the syntax of a language. It might be tempting to introduce a certain syntax, but it can easily increase the difficulty of parsing that language and can even make expanding a grammar rule dependent on the semantics of the source code.

As an example take a look at the function declaration syntax of C++ and why this syntax makes C++ difficult to parse.

int foo(int); declares a function named foo, which returns an int and accepts an int as a parameter. int foo(0); is also a valid C++ code, that declares an int variable and initializes it to 0.

The issue arises when int foo(x); is encountered by the parser. Since C++ allows the creation of user-defined types, x can either be a type or a value. If x is a type, the above sequence of tokens is a function declaration, if x is a value, it is a variable declaration.

struct x;
int foo(x); // declares a function

int x = 0;
int foo(x); // declares a variable

As a result, C++ cannot be parsed without also reasoning about what the source code that has already been parsed means.

When the same sequence of symbols can have a different meaning based on what context they appear in, the grammar is called ambiguous. C++ is known to have multiple ambiguities in its grammar, though some are inherited from C such as the pointer syntax.

typedef char a;
a * b; // declares 'b', a pointer to 'a'

int a = 0, b = 0;
a * b; // multiplies 'a' with 'b'

A similar inherited ambiguity is the cast syntax.

typedef char a;
int b;
(a) - b; // negate 'b' and cast it to 'a'
  
int a = 0, b = 0;
(a) - b; // subtract 'b' from 'a'

A well-known source of ambiguity in programming languages is the generic syntax. Consider the following generic function call, which can appear in both C++ and Kotlin function<type>(argument). For the parser, this is a sequence of Identifier, <, Identifier, >, (, Identifier and ). Is it really a generic function call or a series of comparisons?

template<typename T>
void f(T a) {}

int a;
using t = int;
f<t>(a); // call to template function

int f, t, a;
f<t>(a); // comparison

The source of the problem is that < can either mean the start of a generic argument list or the less-than operator. Rust resolved this ambiguity by introducing the turbofish (::<>). The Rust parser knows that < always means the less-than operator in confusing situations because a generic argument list must begin with :: followed by the <.

fn f<T>() {}

f::<i32>();