The previous section introduced how the source code is
broken down into a list of tokens which are similar to the
building blocks (nouns, verbs, etc.) of sentences in a
spoken language. The
This section talks about the parser.
sentence
is valid in the English language because the mentioned
building blocks follow each other in the correct order.
Similarly fn main(): void {}
is a valid
function declaration in your language for the same
reason.
The parser validates if the tokens follow each other in a valid order and constructs the Abstract Syntax Tree (AST). The AST is an intermediate representation that encapsulates the structure of the source code and later helps the compiler to reason about what a piece of source code means.
So far the language only supports function declarations with
no parameters, which can be modeled as a single AST node. To
make extending the language easier, a
Decl
node is introduced first, which serves as
the base class for all declarations. Each declaration has a
source location and an identifier, so these values are
stored in the base class.
struct Decl {
SourceLocation location;
std::string identifier;
Decl(SourceLocation location, std::string identifier)
: location(location),
identifier(std::move(identifier)) {}
virtual ~Decl() = default;
virtual void dump(size_t level = 0) const = 0;
};
To make visualizing the AST easier, each node has a
dump()
method, that can be used to print the
textual representation of that node.
struct Decl {
...
virtual void dump(size_t level = 0) const = 0;
};
Currently, the only Decl
in the language is the
FunctionDecl
, which in addition to what every
declaration has in common, also has a return type and a
body.
struct FunctionDecl : public Decl {
Type type;
std::unique_ptr<Block> body;
FunctionDecl(SourceLocation location,
std::string identifier,
Type type,
std::unique_ptr<Block> body)
: Decl(location, std::move(identifier)),
type(std::move(type)),
body(std::move(body)) {}
void dump(size_t level = 0) const override;
};
To make the dumping of the node easier the
indent()
helper is introduced, which returns
the indentation of a given level. For the indentation of
each level, 2 spaces are used.
std::string indent(size_t level) { return std::string(level * 2, ' '); }
The textual representation of a
FunctionDecl
includes the name of the node, the
identifier that represents the function, the return type and
the body of the function.
void FunctionDecl::dump(size_t level) const {
std::cerr << indent(level) << "FunctionDecl: " << identifier << ':'
<< type.name << '\n';
body->dump(level + 1);
}
The body of the function is represented by a
Block
node, which can currently only be empty,
though this changes soon. A Block
is a
standalone node without a base class as there are no other
elements in the language, that share the same functionality.
struct Block {
SourceLocation location;
Block(SourceLocation location)
: location(location) {}
void dump(size_t level = 0) const;
};
Because a Block
doesn't have any child nodes,
its textual representation only includes the name of the
node.
void Block::dump(size_t level) const {
std::cerr << indent(level) << "Block\n";
}
Design Note
Lately, some compiler engineers started using
std::variant
instead of inheritance to model the AST, where the variant acts as a union of nodes.using Decl = std::variant<FunctionDecl, ParamDecl, VarDecl>;
The complexity of this design arises once the nodes need to reference each other, which will be unavoidable later with expressions like
(1 + 2) * 3
.The
( ... )
is called a grouping expression and acts as a wrapper around other expressions, which is difficult to model using a variant.using Expr = std::variant<..., GroupingExpr>; struct GroupingExpr { Expr innerExpr; };
The snippet above doesn't compile because the compiler needs to know the size of
GroupingExpr
. The size of the node however depends on the size of thestd::variant
, which contains theGroupingExpr
too, so for the compiler to know the size of thestd::variant
, it has to know the size of theGroupingExpr
first. The workaround is to use a pointer.using Expr = std::variant<..., GroupingExpr>; struct GroupingExpr { Expr *innerExpr; };
In this case, the question is, who owns the memory for the
innerExpr
field? Who allocates it, who is responsible for freeing it, etc. The workaround for this problem is to use astd::unique_ptr
.struct GroupingExpr { std::unique_ptr<Expr> innerExpr; };
Now it's clear that the node is the owner of its child node. However, to know the current type of the variant,
innerExpr
needs to be type-checked. The same type checking however could also be performed on the pointer itself ifExpr
was a polymorphic base class. To avoid complexities, this compiler keeps things simple and uses inheritance to model the AST.
A type is a set of values with the same properties. The
language only supports the void
,
number
and custom
types, which are
encapsulated by the Type
class. The actual kind
of the type, that an instance of the Type
class
holds is represented by the Kind
enum.
struct Type {
enum class Kind { Void, Number, Custom };
...
};
An instance of the Type
class holds one of
these kinds and the textual representation of the
corresponding type.
struct Type {
...
Kind kind;
std::string name;
...
};
To make the usage of the class safer and prevent accidental
invalid type instantiation, the constructor of the class is
private
and a set of static
member
functions are provided that instantiate and return the
correct Type
.
struct Type {
...
static Type builtinVoid() { return {Kind::Void, "void"}; }
static Type builtinNumber() { return {Kind::Number, "number"}; }
static Type custom(const std::string &name) { return {Kind::Custom, name}; }
private:
Type(Kind kind, std::string name) : kind(kind), name(std::move(name)){};
};
The custom
type acts as a placeholder for
user-defined types, but it is not accepted as a valid type
by the compiler for now.
Design Note
Theoretically, a function is also a separate type, so in a more complex language with a more complex type system, this should also be encapsulated somehow.
In C++ functions can also be stored in variables or passed as parameters, which calls for a dedicated function type. To be able to model the complexity of C++ types precisely,
Clang
uses a layer-based type system, where each layer is a different higher-level type.An
int *
is represented using 2 layers, one for theint
and one for the*
.int *x; -PointerType 'int *' `-BuiltinType 'int'
A function with a complex parameter on the other hand can span across multiple layers.
int main(int argc, const char **argv) FunctionProtoType 'int (int, const char **)' |-BuiltinType 'int' |-BuiltinType 'int' `-PointerType 'const char **' `-PointerType 'const char *' `-QualType 'const char' const `-BuiltinType 'char'
The parser iterates over the tokens returned by the lexer and constructs the AST of the program. As the lexer always points to the character that is to be processed next, the parser points to the token that is to be processed next.
The Parser
takes the Lexer
as a
dependency and immediately invokes it upon instantiation to
get the first token to process.
class Parser {
Lexer *lexer;
Token nextToken;
public:
explicit Parser(Lexer &lexer)
: lexer(&lexer),
nextToken(lexer.getNextToken()) {}
};
Once the parser finishes processing the next token, it calls
the eatNextToken()
helper, which consumes it
and calls the lexer for the following one.
void Parser::eatNextToken() {
nextToken = lexer->getNextToken();
}
The following figure visualizes how the parser processes the
tokens and where the nextToken
field points to.
'^' marks the content of the 'nextToken' field
Parser::Parser()
┌────┐ ┌──────┐ ┌───┐ ┌───┐ ┌───┐ ┌──────┐ ┌───┐
│ fn │ │ main │ │ ( │ │ ) │ │ : │ │ void │ │ { │ ...
└────┘ └──────┘ └───┘ └───┘ └───┘ └──────┘ └───┘
^
eatNextToken()
┌──────┐ ┌───┐ ┌───┐ ┌───┐ ┌──────┐ ┌───┐ ┌───┐
│ main │ │ ( │ │ ) │ │ : │ │ void │ │ { │ │ } │ ...
└──────┘ └───┘ └───┘ └───┘ └──────┘ └───┘ └───┘
^
eatNextToken()
┌───┐ ┌───┐ ┌───┐ ┌──────┐ ┌───┐ ┌───┐ ┌─────┐
│ ( │ │ ) │ │ : │ │ void │ │ { │ │ } │ │ EOF │
└───┘ └───┘ └───┘ └──────┘ └───┘ └───┘ └─────┘
^
The logic for processing the tokens lives in the
parseSourceFile()
method. The core of it is a
loop that iterates until the Eof
token is
encountered.
// <sourceFile>
// ::= <functionDecl>* EOF
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
...
while (nextToken.kind != TokenKind::Eof) {
...
}
...
}
Because so far the language only supports function
declarations, which begin with fn
, the parser
checks if the next token to be processed is
KwFn
and parses the
FunctionDecl
accordingly. If the next token is
not what it expects an error is reported to the user.
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
...
while (nextToken.kind != TokenKind::Eof) {
if (nextToken.kind != TokenKind::KwFn) {
report(nextToken.location,
"only function declarations are allowed on the top level");
...
}
auto fn = parseFunctionDecl();
...
}
...
}
Since the rest of the parser mostly works with pointers, the
report()
function always returns a
nullptr
to make stopping on errors more
convenient. It also comes with an optional
isWarning
flag to denote that the message to
report is not an error.
std::nullptr_t report(SourceLocation location, std::string_view message,
bool isWarning = false) {
const auto &[file, line, col] = location;
std::cerr << file << ':' << line << ':' << col << ':'
<< (isWarning ? " warning: " : " error: ") << message << '\n';
return nullptr;
}
The successfully parsed function declarations are then
collected into a
std::vector
that also acts as the root of the
AST.
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
std::vector<std::unique_ptr<FunctionDecl>> functions;
while (nextToken.kind != TokenKind::Eof) {
...
if (!fn) {
...
}
functions.emplace_back(std::move(fn));
}
...
}
The source code might be invalid and the parser fails to
process it completely. In that case, the AST is incomplete,
which is marked by the
incompleteAST
flag.
class Parser {
...
bool incompleteAST = false;
...
};
Once the processing of the tokens has finished, the successfully parsed functions are returned along with the indicator of whether the AST is complete or not.
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
...
return {std::move(functions), !incompleteAST};
}
Failing to parse a function declaration doesn't mean that
the parsing of the whole file should be stopped. It's known
that a function declaration always begins with the
fn
keyword and respectively, if the
fn
keyword is seen, it can only mean the
beginning of a function. This information can be used in the
parser to recover from errors.
┌────┐ ┌───┐ ┌───┐ ┌────┐ ┌──────┐ ┌───┐ ┌───┐ ┌───┐
│ fn │ │ : │ │ : │ │ fn │ │ main │ │ ( │ │ ) │ │ : │ ...
└────┘ └───┘ └───┘ └────┘ └──────┘ └───┘ └───┘ └───┘
^ ^
| └ the next function begins here
└ error: unexpected token
If an error happens, every token can be skipped until the
next KwFn
token or Eof
is seen.
This process is called the synchronization of the parser on
the KwFn
token and it is handled by the
synchronizeOn
method. When synchronization
happens, the AST is guaranteed to be incomplete, so the
corresponding flag is set too.
void Parser::synchronizeOn(TokenKind kind) {
incompleteAST = true;
while (nextToken.kind != kind && nextToken.kind != TokenKind::Eof)
eatNextToken();
}
The parser synchronizes in the
parseSourceFile()
method every time it
encounters an unexpected token or it fails to parse a
function declaration.
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
...
if (nextToken.kind != TokenKind::KwFn) {
...
synchronizeOn(TokenKind::KwFn);
continue;
}
...
if (!fn) {
synchronizeOn(TokenKind::KwFn);
continue;
}
...
}
A function declaration always starts with the
KwFn
token, which is followed by the
Identifier
, (
, )
,
:
tokens, a Type
specifier and a
Block
at the end.
To make the work in the parser more convenient, the
varOrReturn()
and
matchOrReturn()
helper macros are introduced.
The former checks if the parsing of a node was successful
and creates a variable that stores the result, or returns a
nullptr
on error.
#define varOrReturn(var, init) \
auto var = (init); \
if (!var) \
return nullptr;
The latter checks if nextToken
is of the
expected kind and reports an error otherwise. The
report()
helper also returns a
nullptr
, so the parsing of the current node
stops too.
#define matchOrReturn(tok, msg) \
if (nextToken.kind != tok) \
return report(nextToken.location, msg);
The parseFunctionDecl()
method expects the
current token to be KwFn
, saves its location as
the beginning of the function and checks if the rest of the
tokens are in the correct order.
// <functionDecl>
// ::= 'fn' <identifier> '(' ')' ':' <type> <block>
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
SourceLocation location = nextToken.location;
eatNextToken(); // eat fn
...
}
The fn
keyword is followed by an identifier.
For the Identifier
token, the textual
representation of the identifier is stored in the
value
field. This value is saved before the
token is eaten.
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
...
matchOrReturn(TokenKind::Identifier, "expected identifier");
std::string functionIdentifier = *nextToken.value;
eatNextToken(); // eat identifier
...
}
The next tokens denoting the start and end of the argument list are single-character tokens, which don't require any special handling.
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
...
matchOrReturn(TokenKind::Lpar, "expected '('");
eatNextToken(); // eat '('
matchOrReturn(TokenKind::Rpar, "expected ')'");
eatNextToken(); // eat ')'
...
}
The argument list is followed by a :
and a type
specifier, which is parsed by a dedicated
parseType()
helper.
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
...
matchOrReturn(TokenKind::Colon, "expected ':'");
eatNextToken(); // eat ':'
varOrReturn(type, parseType());
...
}
Finally, the Block
is parsed by the
parseBlock()
method. Similarly to the current
method, parseBlock()
also expects the first
token to be the start of the block, so that token is checked
before calling the function.
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
...
matchOrReturn(TokenKind::Lbrace, "expected function body");
varOrReturn(block, parseBlock());
...
}
If everything is successful, the
FunctionDecl
node is returned.
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
...
return std::make_unique<FunctionDecl>(location, functionIdentifier, *type,
std::move(block));
}
Parsing the type has been extracted into a dedicated helper
method so that it can be reused later when the language is
extended. The number
type is handled in a later
chapter as so far no token can represent it.
This method checks if the current token is
KwVoid
or Identifier
and returns
the corresponding void
or
custom
types and reports an error otherwise.
// <type>
// ::= 'number'
// | 'void'
// | <identifier>
std::optional<Type> Parser::parseType() {
TokenKind kind = nextToken.kind;
if (kind == TokenKind::KwVoid) {
eatNextToken(); // eat 'void'
return Type::builtinVoid();
}
if (kind == TokenKind::Identifier) {
auto t = Type::custom(*nextToken.value);
eatNextToken(); // eat identifier
return t;
}
report(nextToken.location, "expected type specifier");
return std::nullopt;
};
The parseBlock()
method is also simple at this
point. It checks for the correct token order and returns a
Block
upon success or a
nullptr
upon error.
// <block>
// ::= '{' '}'
std::unique_ptr<Block> Parser::parseBlock() {
SourceLocation location = nextToken.location;
eatNextToken(); // eat '{'
matchOrReturn(TokenKind::Rbrace, "expected '}' at the end of a block");
eatNextToken(); // eat '}'
return std::make_unique<Block>(location);
}
Every source file must contain a
main()
function. If no such function is found,
an error should be reported on the Eof
token.
Because the parser is the last stage of the compilation
pipeline that has access to the tokens, the error is emitted
at the end of parseSourceFile()
.
To implement this, the parser first iterates over the successfully parsed functions and remembers if any of them is called main.
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
...
while (...) {
...
}
bool hasMainFunction = false;
for (auto &&fn : functions)
hasMainFunction |= fn->identifier == "main";
...
}
If main()
is not found and the AST is complete
an error is reported. In the case of an incomplete AST it
might have been parsing the main()
function
that caused the syntax error, so nothing is reported to
avoid false positives.
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
...
if (!hasMainFunction && !incompleteAST)
report(nextToken.location, "main function not found");
...
}
The returned indicator is also updated such that the AST is
only considered complete if it contains the
main()
function too.
std::pair<std::vector<std::unique_ptr<FunctionDecl>>, bool>
Parser::parseSourceFile() {
...
return {..., !incompleteAST && hasMainFunction};
}
Like spoken languages, programming languages also have a grammar, which describes which order of tokens is valid in the language and which isn't. The parser is an automaton that recognizes this grammar and only accepts source files, which are valid according to it.
For example fn main(): void {}
is a valid
function declaration, while void main() {}
is
not recognized as a function declaration, because the type
annotation is missing.
There are multiple formats for defining the grammar of a programming language, one of which is the Backus-Naur Form or BNF, which could already be seen as comments above the parser methods.
The BNF defines terminal and non-terminal symbols. Non-terminal symbols are usually enclosed by < and > and they can be extended further.
The conventional notation to describe symbols looks like
<non-terminal> ::= expansion | expansion | expansion
. An expansion can consist of multiple terminal and
non-terminal symbols as well. To express the number of times
a symbol can occur after each other, ?
is used
to denote 0-1 occurrence and *
is used to
denote any number of occurrences. If an element is not
followed by one of these symbols, that means it occurs
exactly once.
<sourceFile>
::= <digit>*
| <digit>? 'Hello World'
<digit>
::= '0'
| '1'
A parser that recognizes the grammar defined by the BNF above can accept the following inputs.
input: 'Hello World'
<sourceFile>: <digit>? 'Hello World'
input: 0 'Hello World'
<sourceFile>: <digit>? 'Hello World'
<digit>: '0'
input: 1 0 1
<sourceFile>: <digit>*
<digit>: '1'
<digit>: '0'
<digit>: '1'
input: ( )
<sourceFile>: no expansions can be applied
Understanding how the parser works is crucial to designing the syntax of a language. It might be tempting to introduce a certain syntax, but it can easily increase the difficulty of parsing that language and can even make expanding a grammar rule dependent on the semantics of the source code.
As an example take a look at the function declaration syntax of C++ and why this syntax makes C++ difficult to parse.
int foo(int);
declares a function named
foo
, which returns an int
and
accepts an int
as a parameter.
int foo(0);
is also a valid C++ code, that
declares an int
variable and initializes it to
0
.
The issue arises when
int foo(x);
is encountered by the parser. Since
C++ allows the creation of user-defined types,
x
can either be a type or a value. If
x
is a type, the above sequence of tokens is a
function declaration, if x
is a value, it is a
variable declaration.
struct x;
int foo(x); // declares a function
int x = 0;
int foo(x); // declares a variable
As a result, C++ cannot be parsed without also reasoning about what the source code that has already been parsed means.
When the same sequence of symbols can have a different meaning based on what context they appear in, the grammar is called ambiguous. C++ is known to have multiple ambiguities in its grammar, though some are inherited from C such as the pointer syntax.
typedef char a;
a * b; // declares 'b', a pointer to 'a'
int a = 0, b = 0;
a * b; // multiplies 'a' with 'b'
A similar inherited ambiguity is the cast syntax.
typedef char a;
int b;
(a) - b; // negate 'b' and cast it to 'a'
int a = 0, b = 0;
(a) - b; // subtract 'b' from 'a'
A well-known source of ambiguity in programming languages is
the generic syntax. Consider the following generic function
call, which can appear in both C++ and Kotlin
function<type>(argument)
. For the parser,
this is a sequence of Identifier
,
<
, Identifier
, >
,
(
, Identifier
and )
.
Is it really a generic function call or a series of
comparisons?
template<typename T>
void f(T a) {}
int a;
using t = int;
f<t>(a); // call to template function
int f, t, a;
f<t>(a); // comparison
The source of the problem is that <
can
either mean the start of a generic argument list or the
less-than operator. Rust resolved this ambiguity by
introducing the turbofish (::<>
). The Rust
parser knows that <
always means the
less-than operator in confusing situations because a generic
argument list must begin with ::
followed by
the <
.
fn f<T>() {}
f::<i32>();