There is not much to be achieved in a language that only allows function declarations with no parameters and an empty body on the top level of the source file. To be able to create an executable with a series of instructions, the language is extended with a few other elements such as function parameters, a built-in print function, number literals and return statements.
fn wrapper(n: number): number {
return n;
}
fn main(): void {
println(wrapper(12.34));
}
In a programming language, an element that evaluates to some
value is called an expression. For example,
1 + 1 is an expression, as it evaluates to
2 however, return n doesn't
evaluate to anything, so it's not an expression. It only
changes the state of the program as it overrides where the
execution should continue, hence it is called a statement.
Every statement is derived from the Stmt
node, which stores a
SourceLocation and provides a
dump() method that the derived nodes must
override to print their textual representation.
struct Stmt {
SourceLocation location;
Stmt(SourceLocation location)
: location(location) {}
virtual ~Stmt() = default;
virtual void dump(size_t level = 0) const = 0;
};
By definition, every expression is also a statement. When
1 + 1 is evaluated, the result needs to be
stored either in memory or in a register before it can be
processed further, which is also a modification of the
program state. The Expr node acts as the
placeholder node for every expression in the language.
struct Expr : public Stmt {
Expr(SourceLocation location)
: Stmt(location) {}
};
A statement is allowed to span across multiple lines, which
can make it difficult for the parser to figure out where it
ends. To make parsing statements easier, a
';' must be placed at the end of each
statement.
Design Note
If a language doesn't require a
';'at the end of a statement, like Python, Scala, etc. it's the responsibility of the parser to figure out where that statement ends. In some cases, however, it is not a trivial problem. Is the following snippet a function call likefoo(bar);or two separate expressions likefoo; (bar)?foo (bar)Python treats every
'\n'as an end of a statement except for some cases like anifstatement, which only ends afterstmt3and not right after the:.if cond: stmt1 stmt2 stmt3In Scala, on the other hand, the grammar defines a complex set of rules, when a
'\n'terminates a statement and when it is ignored. These rules can be found in the lexical syntax specification of Scala.
The grammar is updated such that statements are only allowed inside blocks and every block is allowed to contain 0 or more statements.
<block>
::= '{' <statement>* '}'
The Block node is also updated to store a list
of statements.
struct Block {
...
std::vector<std::unique_ptr<Stmt>> statements;
Block(..., std::vector<std::unique_ptr<Stmt>> statements)
: ...,
statements(std::move(statements)) {}
...
};
In the textual representation of a block, each of these statements is listed too.
void Block::dump(size_t level) const {
...
for (auto &&stmt : statements)
stmt->dump(level + 1);
}
Parsing the list of statements within the block is done
using a while loop, which iterates until it
sees the } token, which indicates the end of
the block.
std::unique_ptr<Block> Parser::parseBlock() {
SourceLocation location = nextToken.location;
eatNextToken(); // eat '{'
while (true) {
if (nextToken.kind == TokenKind::Rbrace)
break;
...
}
...
}
If the Eof or KwFn tokens are seen
the block is most likely not closed by a }, so
an error is reported and a nullptr is returned.
std::unique_ptr<Block> Parser::parseBlock() {
...
while (true) {
...
if (nextToken.kind == TokenKind::Eof || nextToken.kind == TokenKind::KwFn)
return report(nextToken.location, "expected '}' at the end of a block");
...
}
...
}
If none of the previous cases matches, the parser attempts
to parse a Stmt. If it fails the parser is left
in an invalid state and immediately stops parsing the
Block. Otherwise, the Stmt is
inserted into a statement list.
std::unique_ptr<Block> Parser::parseBlock() {
...
std::vector<std::unique_ptr<Stmt>> statements;
while (true) {
...
varOrReturn(stmt, parseStmt());
statements.emplace_back(std::move(stmt));
}
...
}
If everything is successful, the } at the end
of the block is consumed and a Block node is
returned.
std::unique_ptr<Block> Parser::parseBlock() {
...
eatNextToken(); // eat '}'
return std::make_unique<Block>(location, std::move(statements));
}
So far the language only supports two kinds of statements, the return statement and expressions.
<statement>
::= <returnStmt>
| <expr> ';'
A return statement can either appear inside a void function or a non-void one. If it's in the latter, the return keyword is also followed by an expression, whose result is returned.
<returnStmt>
::= 'return' <expr>? ';'
These new rules introduce the ; token as well
as the return keyword which are currently
unknown to the lexer. To proceed further, the lexer needs to
be extended to handle these tokens.
constexpr char singleCharTokens[] = {..., ';'};
enum class TokenKind : char {
...
KwReturn,
...
Semi = singleCharTokens[6],
};
const std::unordered_map<std::string_view, TokenKind> keywords = {
...
{"return", TokenKind::KwReturn}};
The AST node of the ReturnStmt inherits from
the Stmt class, and contains an
Expr child node.
struct ReturnStmt : public Stmt {
std::unique_ptr<Expr> expr;
ReturnStmt(SourceLocation location, std::unique_ptr<Expr> expr = nullptr)
: Stmt(location),
expr(std::move(expr)) {}
void dump(size_t level = 0) const override;
};
The textual representation of this node includes the name of the statement and its contained expression if there is one.
void ReturnStmt::dump(size_t level) const {
std::cerr << indent(level) << "ReturnStmt\n";
if (expr)
expr->dump(level + 1);
}
The parseReturnStmt() method expects the first
token to be KwReturn and stores its location to
mark the beginning of the statement before eating it.
std::unique_ptr<ReturnStmt> Parser::parseReturnStmt() {
SourceLocation location = nextToken.location;
eatNextToken(); // eat 'return'
...
}
If the token after KwReturn is not a
; it must be the start of an expression. If the
parsing of that expression fails, the return statement is
invalid and the parser errors out.
std::unique_ptr<ReturnStmt> Parser::parseReturnStmt() {
...
std::unique_ptr<Expr> expr;
if (nextToken.kind != TokenKind::Semi) {
expr = parseExpr();
if (!expr)
return nullptr;
}
...
}
Otherwise, the upcoming token is expected to be the
; marking the end of the return statement. If
the token matches, everything is successful and the
ReturnStmt node is returned.
std::unique_ptr<ReturnStmt> Parser::parseReturnStmt() {
...
matchOrReturn(TokenKind::Semi,
"expected ';' at the end of a return statement");
eatNextToken(); // eat ';'
return std::make_unique<ReturnStmt>(location, std::move(expr));
}
The driver code for parsing statements lives in the
parseStmt() method. This is the method that
checks the next token and decides how the parser should
proceed. If it sees that the next token is
KwReturn it dispatches
parseReturnStmt() to return the corresponding
node.
std::unique_ptr<Stmt> Parser::parseStmt() {
if (nextToken.kind == TokenKind::KwReturn)
return parseReturnStmt();
...
}
At the moment the language has support for 3 expressions.
numberLiteral for numeric values written in
the source code.
declRefExpr for referencing declarations.
callExpr for calling functions.
The numberLiteral and the
declRefExpr are considered primary expressions.
<primaryExpression>
::= <numberLiteral>
| <declRefExpr>
Primary means that these are the expressions that every other expression is composed of.
1 + x
^ ^
| └ DeclRefExpr
└ NumberLiteral
A numberLiteral in the source code is a
number, which is a list of digits optionally
separated by a '.' and another list of digits.
<numberLiteral>
::= <number>
<number>
::= ('0'..'9')+ ('.' ('0'..'9')+)?
A number in the compiler is represented by a
dedicated Number token.
enum class TokenKind : char {
...
Number,
...
};
When the lexer encounters a numeric character it keeps concatenating the upcoming characters until it sees a non-numeric one.
Token TheLexer::getNextToken() {
...
if (isNum(currentChar)) {
std::string value{currentChar};
while (isNum(peekNextChar()))
value += eatNextChar();
...
}
...
}
If the seen non-numeric character is not a '.',
a Number token is returned with the eaten
digits as its value. Otherwise, the lexer proceeds with
concatenating the decimal point to the number.
Token TheLexer::getNextToken() {
...
if (isNum(currentChar)) {
...
if (peekNextChar() != '.')
return Token{tokenStartLocation, TokenKind::Number, value};
value += eatNextChar();
...
}
...
}
If the character immediately following the
'.' is not a number, an Unk token
is returned. Otherwise, the fractional part of the number is
consumed and the lexer returns the corresponding
Number token.
Token TheLexer::getNextToken() {
...
if (isNum(currentChar)) {
...
if (!isNum(peekNextChar()))
return Token{tokenStartLocation, TokenKind::Unk};
while (isNum(peekNextChar()))
value += eatNextChar();
return Token{tokenStartLocation, TokenKind::Number, value};
}
...
}
In the AST a NumberLiteral is an
Expr, which holds the textual value of the
number.
struct NumberLiteral : public Expr {
std::string value;
NumberLiteral(SourceLocation location, std::string value)
: Expr(location),
value(value) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node is the name of the expression followed by the contained value.
void NumberLiteral::dump(size_t level) const {
std::cerr << indent(level) << "NumberLiteral: '" << value << "'\n";
}
To keep the parser simple, this node doesn't have a
dedicated parser method, instead it is parsed directly in
parsePrimary().
std::unique_ptr<Expr> Parser::parsePrimary() {
SourceLocation location = nextToken.location;
if (nextToken.kind == TokenKind::Number) {
auto literal = std::make_unique<NumberLiteral>(location, *nextToken.value);
eatNextToken(); // eat number
return literal;
}
...
}
A declRefExpr is a simple addition as it is
only a wrapper around an identifier. Its rule
doesn't introduce any tokens, the lexer can't already
handle.
<declRefExpr>
::= <identifier>
The DeclRefExpr node is derived from
Expr and stores the identifier that is being
referenced.
struct DeclRefExpr : public Expr {
std::string identifier;
DeclRefExpr(SourceLocation location, std::string identifier)
: Expr(location),
identifier(identifier) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node contains its name along with the referenced identifier.
void DeclRefExpr::dump(size_t level) const {
std::cerr << indent(level) << "DeclRefExpr: " << identifier << '\n';
}
Because the parser only needs to eat one token, this node is
also parsed in parsePrimary().
std::unique_ptr<Expr> Parser::parsePrimary() {
...
if (nextToken.kind == TokenKind::Identifier) {
auto declRefExpr =
std::make_unique<DeclRefExpr>(location, *nextToken.value);
eatNextToken(); // eat identifier
return declRefExpr;
}
...
}
There are no other primary expressions for now, so if
parsePrimary() sees any other token it reports
an error.
std::unique_ptr<Expr> Parser::parsePrimary() {
...
return report(location, "expected expression");
}
The callExpr is categorized as a postfix
expression because, in the source code it is an argument
list standing behind a primary expression.
<postfixExpression>
::= <primaryExpression> <argumentList>
<argumentList>
::= '(' (<expr> (',' <expr>)* ','?)? ')'
The arguments of a function call are expressions, so
variable references and values returned by other function
calls can also be passed on to a different function like
foo(0, x, bar());.
Design Note
A question might arise, why is there an optional
','at the end of the argument list? That is called a trailing comma and allows developers to write snippets likefoo(1.0, x,);.This feature can come in handy when the argument list spans across multiple lines and a new argument needs to be added later. Because the previous line doesn't have to be modified, the version control diff is smaller.
foo( 0, 1, + 2, );It also makes it easier to reorder the arguments or generate source files with function calls if needed.
At the end of the day using the trailing comma is up to personal preferences, but since they have more than one pros and no cons, there is no reason not to support them.
The argumentList rule also introduces a new
single character token, the , which needs to be
taught to the lexer.
constexpr char singleCharTokens[] = {..., ','};
enum class TokenKind : char {
...
Comma = singleCharTokens[7],
};
The CallExpr node stores the callee and a
vector of expressions for the arguments.
struct CallExpr : public Expr {
std::unique_ptr<Expr> callee;
std::vector<std::unique_ptr<Expr>> arguments;
CallExpr(SourceLocation location,
std::unique_ptr<Expr> callee,
std::vector<std::unique_ptr<Expr>> arguments)
: Expr(location),
callee(std::move(callee)),
arguments(std::move(arguments)) {}
void dump(size_t level = 0) const override;
};
The textual representation of a
CallExpr includes its name, the callee and its
arguments.
void CallExpr::dump(size_t level) const {
std::cerr << indent(level) << "CallExpr:\n";
callee->dump(level + 1);
for (auto &&arg : arguments)
arg->dump(level + 1);
}
Being the only postfix expression in the language, this node
is parsed in the
parsePostfixExpr() method, which starts with
parsing a primary expression and returning it if it's not
followed by a (.
std::unique_ptr<Expr> Parser::parsePostfixExpr() {
varOrReturn(expr, parsePrimary());
if (nextToken.kind != TokenKind::Lpar)
return expr;
...
}
If the next token is a (, the source location
of the call is the location of that token while the
argumentList is parsed by a separate helper
method and the previously parsed primary acts as the callee.
std::unique_ptr<Expr> Parser::parsePostfixExpr() {
...
SourceLocation location = nextToken.location;
varOrReturn(argumentList, parseArgumentList());
return std::make_unique<CallExpr>(location, std::move(expr),
std::move(*argumentList));
}
The parseArgumentList() helper is responsible
for handling the argument list, which doesn't have an AST
node. Unlike the core parser functions, this helper doesn't
expect the next token to already be the first one in the
rule, instead, it validates the token itself. To keep the
interface of the parser uniform, this helper also returns a
std::unique_ptr<>.
std::unique_ptr<std::vector<std::unique_ptr<Expr>>>
Parser::parseArgumentList() {
matchOrReturn(TokenKind::Lpar, "expected '('");
eatNextToken(); // eat '('
...
}
After consuming the (, the tokens are processed
until a ) is encountered, which indicates the
end of the argument list.
std::unique_ptr<std::vector<std::unique_ptr<Expr>>>
Parser::parseArgumentList() {
...
while (true) {
if (nextToken.kind == TokenKind::Rpar)
break;
...
}
...
}
If the next token is not a ), the parser keeps
gathering the expressions in an argument list until one of
them is not followed by a ,. Since arguments
are separated by commas, if the , is missing
after one of them, it is assumed to be the last expression
in the list.
std::unique_ptr<std::vector<std::unique_ptr<Expr>>>
Parser::parseArgumentList() {
...
std::vector<std::unique_ptr<Expr>> argumentList;
while (true) {
...
varOrReturn(expr, parseExpr());
argumentList.emplace_back(std::move(expr));
if (nextToken.kind != TokenKind::Comma)
break;
eatNextToken(); // eat ','
}
...
}
Finally, the end of the argument list is validated and if everything is successful, the parsed arguments are returned.
std::unique_ptr<std::vector<std::unique_ptr<Expr>>>
Parser::parseArgumentList() Parser::parseArgumentList() {
...
matchOrReturn(TokenKind::Rpar, "expected ')'");
eatNextToken(); // eat ')'
return std::make_unique<std::vector<std::unique_ptr<Expr>>>(std::move(argumentList));
}
For now, in the grammar, the expr rule acts as
a wrapper around postfixExpression.
<expr>
::= <postfixExpression>
The parser attempts to model the grammar accurately, so
parseExpr() also acts as a wrapper around
parsePostfixExpr().
std::unique_ptr<Expr> Parser::parseExpr() {
return parsePostfixExpr();
}
This function is called from parseStmt() which
also ensures that a ; is placed at the end of
the expression.
std::unique_ptr<Stmt> Parser::parseStmt() {
...
varOrReturn(expr, parseExpr());
matchOrReturn(TokenKind::Semi, "expected ';' at the end of expression");
eatNextToken(); // eat ';'
return expr;
}
With the added support for CallExpr, functions
should also be extended with parameters. A parameter in the
grammar is simply an identifier followed by a
: and a type.
<paramDecl>
::= <identifier> ':' <type>
A parameter is represented by a ParamDecl node,
which in addition to the fields in the
Decl base class, also stores the type of the
parameter.
struct ParamDecl : public Decl {
Type type;
ParamDecl(SourceLocation location, std::string identifier, Type type)
: Decl(location, std::move(identifier)),
type(std::move(type)) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node is the name of the node followed by the identifier and the type of the declared parameter.
void ParamDecl::dump(size_t level) const {
std::cerr << indent(level) << "ParamDecl: " << identifier << ':' << type.name
<< '\n';
}
Parsing this declaration is just checking if the three tokens follow each other in the correct order.
std::unique_ptr<ParamDecl> Parser::parseParamDecl() {
SourceLocation location = nextToken.location;
std::string identifier = *nextToken.value;
eatNextToken(); // eat identifier
matchOrReturn(TokenKind::Colon, "expected ':'");
eatNextToken(); // eat :
varOrReturn(type, parseType());
return std::make_unique<ParamDecl>(location, std::move(identifier),
std::move(*type));
}
With the introduction of the Number token, the
number type can also be parsed, which is the
only supported type for function parameters.
std::optional<Type> Parser::parseType() {
...
if (kind == TokenKind::KwNumber) {
eatNextToken(); // eat 'number'
return Type::builtinNumber();
}
...
};
The grammar of the function declaration is updated to
accommodate the parameter declarations. The
'(' and ')' characters are moved
from the functionDecl to the
parameterList rule. Trailing commas are allowed
at the end of a parameterList too.
<functionDecl>
::= 'fn' <identifier> <parameterList> ':' <type> <block>
<parameterList>
::= '(' (<paramDecl> (',' <paramDecl>)* ','?)? ')'
In the FunctionDecl node, the parameters are
stored in a std::vector.
struct FunctionDecl : public Decl {
...
std::vector<std::unique_ptr<ParamDecl>> params;
...
FunctionDecl(...,
std::vector<std::unique_ptr<ParamDecl>> params,
...)
: ...,
params(std::move(params)),
... {}
...
}
They are also included in the textual representation of the node before the body.
void FunctionDecl::dump(size_t level) const {
...
for (auto &¶m : params)
param->dump(level + 1);
body->dump(level + 1);
}
Instead of matching ( and ), the
parseFunctionDecl() method parses a parameter
list, which is later used in the creation of the
FunctionDecl node.
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
...
eatNextToken(); // eat identifier
varOrReturn(parameterList, parseParameterList());
matchOrReturn(TokenKind::Colon, "expected ':'");
...
return std::make_unique<FunctionDecl>(...,
std::move(*parameterList),
...);
}
The content of parseParameterList() is almost
identical to parseArgumentList(). First, it
ensures that the next token at the time of calling the
method is a (.
std::unique_ptr<std::vector<std::unique_ptr<ParamDecl>>>
Parser::parseParameterList( Parser::parseParameterList() {
matchOrReturn(TokenKind::Lpar, "expected '('");
eatNextToken(); // eat '('
...
}
Then similarly to expressions in argument lists, every
ParamDecl is collected into a list.
std::unique_ptr<std::vector<std::unique_ptr<ParamDecl>>>
Parser::parseParameterList() {
...
std::vector<std::unique_ptr<ParamDecl>> parameterList;
while (true) {
if (nextToken.kind == TokenKind::Rpar)
break;
matchOrReturn(TokenKind::Identifier, "expected parameter declaration");
varOrReturn(paramDecl, parseParamDecl());
parameterList.emplace_back(std::move(paramDecl));
if (nextToken.kind != TokenKind::Comma)
break;
eatNextToken(); // eat ','
}
...
}
Finally the closing ) is matched and the
parameter list is returned.
std::unique_ptr<std::vector<std::unique_ptr<ParamDecl>>>
Parser::parseParameterList() {
...
matchOrReturn(TokenKind::Rpar, "expected ')'");
eatNextToken(); // eat ')'
return std::make_unique<std::vector<std::unique_ptr<ParamDecl>>>(std::move(parameterList));
}
Now that the language has more elements besides function declarations, it's worth dedicating some time to make the error recovery smarter.
With the current modification, there are 4 tokens, that can act as synchronization points in case of an error.
KwFn};Eof
The parser synchronizes on these points after failing to
parse a statement inside a Block. In that case,
the token on which the parser is synchronized is
automatically selected by the
synchronize() method.
std::unique_ptr<Block> Parser::parseBlock() {
...
while (true) {
...
if (!stmt) {
synchronize();
continue;
}
...
}
...
}
The synchronize() method flags the current AST
as incomplete and keeps eating tokens until one of the
previously listed synchronization points is encountered.
void Parser::synchronize() {
incompleteAST = true;
...
while (true) {
...
eatNextToken();
}
}
If the parser fails inside a Block, the
synchronization point is either the end of that block or the
token after the end of the first direct nested block.
Indirect nested blocks are ignored.
fn foo(): void { fail } fn ...
┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌─────┐
│ ... │ │ { │ │ ... │ │ } │ │ ... │
└─────┘ └───┘ └─────┘ └───┘ └─────┘
^ ^
| └ synchronize here
└ fail here
fn foo(): void { if fail { stmt; } stmt; } fn ...
┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌─────┐ ┌───┐
│ ... │ │ { │ │ ... │ │ { │ │ ... │ │ } │ │ ... │ │ } │
└─────┘ └───┘ └─────┘ └───┘ └─────┘ └───┘ └─────┘ └───┘
^ ^
| synchronize here ┘
└ fail here
fn foo(): void { if fail { if cond {} } stmt; } fn ...
┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌───┐ ┌───┐ ┌─────┐ ┌───┐
│ ... │ │ { │ │ ... │ │ { │ │ } │ │ } │ │ ... │ │ } │
└─────┘ └───┘ └─────┘ └───┘ └───┘ └───┘ └─────┘ └───┘
^ ^ ^
| ignore ┘ |
└ fail here synchronize here ┘
To achieve this, the parser keeps track of how many
{ it has seen so far.
void Parser::synchronize() {
...
int braces = 0;
while (true) {
TokenKind kind = nextToken.kind;
if (kind == TokenKind::Lbrace) {
++braces;
}
...
}
}
Once a } is encountered and the number of
braces is 0, the parser is at the end of the
block, so it's synchronized.
void Parser::synchronize() {
...
while (true) {
...
else if (kind == TokenKind::Rbrace) {
if (braces == 0)
break;
...
}
...
}
}
If the number of braces is
1, it is the end of the first direct nested
block, so the token is eaten and the parser is at the start
of the next statement.
void Parser::synchronize() {
...
else if (kind == TokenKind::Rbrace) {
...
if (braces == 1) {
eatNextToken(); // eat '}'
break;
}
...
}
...
}
Otherwise, the parser is inside an indirect nested block, so
it skips it by decreasing braces by one.
void Parser::synchronize() {
...
else if (kind == TokenKind::Rbrace) {
...
--braces;
}
...
}
If a statement fails, synchronization happens after the next
; unless the ; is in a nested
block.
┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌───┐ ┌─────┐
│ ... │ │ { │ │ ... │ │ ; │ │ } │ │ ... │
└─────┘ └───┘ └─────┘ └───┘ └───┘ └─────┘
^ ^
| synchronize here ┘
└ fail here
┌─────┐ ┌───┐ ┌─────┐
│ ... │ │ ; │ │ ... │
└─────┘ └───┘ └─────┘
^ ^
| └ synchronize here
└ fail here
In other words, if the next token is ; and the
number of { seen is 0.
void Parser::synchronize() {
...
while (true) {
...
} else if (kind == TokenKind::Semi && braces == 0) {
eatNextToken(); // eat ';'
break;
}
...
}
If neither of the above conditions are met the parser
synchronizes on KwFn or Eof.
void Parser::synchronize() {
...
while (true) {
...
} else if (kind == TokenKind::KwFn || kind == TokenKind::Eof)
break;
...
}