There is not much to be achieved in a language that only allows function declarations with no parameters and an empty body on the top level of the source file. To be able to create an executable with a series of instructions, the language is extended with a few other elements such as function parameters, a built-in print function, number literals and return statements.
fn wrapper(n: number): number {
return n;
}
fn main(): void {
println(wrapper(12.34));
}
In a programming language, an element that evaluates to some
value is called an expression. For example,
1 + 1
is an expression, as it evaluates to
2
however, return n
doesn't
evaluate to anything, so it's not an expression. It only
changes the state of the program as it overrides where the
execution should continue, hence it is called a statement.
Every statement is derived from the Stmt
node, which stores a
SourceLocation
and provides a
dump()
method that the derived nodes must
override to print their textual representation.
struct Stmt {
SourceLocation location;
Stmt(SourceLocation location)
: location(location) {}
virtual ~Stmt() = default;
virtual void dump(size_t level = 0) const = 0;
};
By definition, every expression is also a statement. When
1 + 1
is evaluated, the result needs to be
stored either in memory or in a register before it can be
processed further, which is also a modification of the
program state. The Expr
node acts as the
placeholder node for every expression in the language.
struct Expr : public Stmt {
Expr(SourceLocation location)
: Stmt(location) {}
};
A statement is allowed to span across multiple lines, which
can make it difficult for the parser to figure out where it
ends. To make parsing statements easier, a
';'
must be placed at the end of each
statement.
Design Note
If a language doesn't require a
';'
at the end of a statement, like Python, Scala, etc. it's the responsibility of the parser to figure out where that statement ends. In some cases, however, it is not a trivial problem. Is the following snippet a function call likefoo(bar);
or two separate expressions likefoo; (bar)
?foo (bar)
Python treats every
'\n'
as an end of a statement except for some cases like anif
statement, which only ends afterstmt3
and not right after the:
.if cond: stmt1 stmt2 stmt3
In Scala, on the other hand, the grammar defines a complex set of rules, when a
'\n'
terminates a statement and when it is ignored. These rules can be found in the lexical syntax specification of Scala.
The grammar is updated such that statements are only allowed inside blocks and every block is allowed to contain 0 or more statements.
<block>
::= '{' <statement>* '}'
The Block
node is also updated to store a list
of statements.
struct Block {
...
std::vector<std::unique_ptr<Stmt>> statements;
Block(..., std::vector<std::unique_ptr<Stmt>> statements)
: ...,
statements(std::move(statements)) {}
...
};
In the textual representation of a block, each of these statements is listed too.
void Block::dump(size_t level) const {
...
for (auto &&stmt : statements)
stmt->dump(level + 1);
}
Parsing the list of statements within the block is done
using a while
loop, which iterates until it
sees the }
token, which indicates the end of
the block.
std::unique_ptr<Block> Parser::parseBlock() {
SourceLocation location = nextToken.location;
eatNextToken(); // eat '{'
while (true) {
if (nextToken.kind == TokenKind::Rbrace)
break;
...
}
...
}
If the Eof
or KwFn
tokens are seen
the block is most likely not closed by a }
, so
an error is reported and a nullptr
is returned.
std::unique_ptr<Block> Parser::parseBlock() {
...
while (true) {
...
if (nextToken.kind == TokenKind::Eof || nextToken.kind == TokenKind::KwFn)
return report(nextToken.location, "expected '}' at the end of a block");
...
}
...
}
If none of the previous cases matches, the parser attempts
to parse a Stmt
. If it fails the parser is left
in an invalid state and immediately stops parsing the
Block
. Otherwise, the Stmt
is
inserted into a statement list.
std::unique_ptr<Block> Parser::parseBlock() {
...
std::vector<std::unique_ptr<Stmt>> statements;
while (true) {
...
varOrReturn(stmt, parseStmt());
statements.emplace_back(std::move(stmt));
}
...
}
If everything is successful, the }
at the end
of the block is consumed and a Block
node is
returned.
std::unique_ptr<Block> Parser::parseBlock() {
...
eatNextToken(); // eat '}'
return std::make_unique<Block>(location, std::move(statements));
}
So far the language only supports two kinds of statements, the return statement and expressions.
<statement>
::= <returnStmt>
| <expr> ';'
A return statement can either appear inside a void function or a non-void one. If it's in the latter, the return keyword is also followed by an expression, whose result is returned.
<returnStmt>
::= 'return' <expr>? ';'
These new rules introduce the ;
token as well
as the return
keyword which are currently
unknown to the lexer. To proceed further, the lexer needs to
be extended to handle these tokens.
constexpr char singleCharTokens[] = {..., ';'};
enum class TokenKind : char {
...
KwReturn,
...
Semi = singleCharTokens[6],
};
const std::unordered_map<std::string_view, TokenKind> keywords = {
...
{"return", TokenKind::KwReturn}};
The AST node of the ReturnStmt
inherits from
the Stmt
class, and contains an
Expr
child node.
struct ReturnStmt : public Stmt {
std::unique_ptr<Expr> expr;
ReturnStmt(SourceLocation location, std::unique_ptr<Expr> expr = nullptr)
: Stmt(location),
expr(std::move(expr)) {}
void dump(size_t level = 0) const override;
};
The textual representation of this node includes the name of the statement and its contained expression if there is one.
void ReturnStmt::dump(size_t level) const {
std::cerr << indent(level) << "ReturnStmt\n";
if (expr)
expr->dump(level + 1);
}
The parseReturnStmt()
method expects the first
token to be KwReturn
and stores its location to
mark the beginning of the statement before eating it.
std::unique_ptr<ReturnStmt> Parser::parseReturnStmt() {
SourceLocation location = nextToken.location;
eatNextToken(); // eat 'return'
...
}
If the token after KwReturn
is not a
;
it must be the start of an expression. If the
parsing of that expression fails, the return statement is
invalid and the parser errors out.
std::unique_ptr<ReturnStmt> Parser::parseReturnStmt() {
...
std::unique_ptr<Expr> expr;
if (nextToken.kind != TokenKind::Semi) {
expr = parseExpr();
if (!expr)
return nullptr;
}
...
}
Otherwise, the upcoming token is expected to be the
;
marking the end of the return statement. If
the token matches, everything is successful and the
ReturnStmt
node is returned.
std::unique_ptr<ReturnStmt> Parser::parseReturnStmt() {
...
matchOrReturn(TokenKind::Semi,
"expected ';' at the end of a return statement");
eatNextToken(); // eat ';'
return std::make_unique<ReturnStmt>(location, std::move(expr));
}
The driver code for parsing statements lives in the
parseStmt()
method. This is the method that
checks the next token and decides how the parser should
proceed. If it sees that the next token is
KwReturn
it dispatches
parseReturnStmt()
to return the corresponding
node.
std::unique_ptr<Stmt> Parser::parseStmt() {
if (nextToken.kind == TokenKind::KwReturn)
return parseReturnStmt();
...
}
At the moment the language has support for 3 expressions.
numberLiteral
for numeric values written in
the source code.
declRefExpr
for referencing declarations.
callExpr
for calling functions.
The numberLiteral
and the
declRefExpr
are considered primary expressions.
<primaryExpression>
::= <numberLiteral>
| <declRefExpr>
Primary means that these are the expressions that every other expression is composed of.
1 + x
^ ^
| └ DeclRefExpr
└ NumberLiteral
A numberLiteral
in the source code is a
number
, which is a list of digits optionally
separated by a '.'
and another list of digits.
<numberLiteral>
::= <number>
<number>
::= ('0'..'9')+ ('.' ('0'..'9')+)?
A number
in the compiler is represented by a
dedicated Number
token.
enum class TokenKind : char {
...
Number,
...
};
When the lexer encounters a numeric character it keeps concatenating the upcoming characters until it sees a non-numeric one.
Token TheLexer::getNextToken() {
...
if (isNum(currentChar)) {
std::string value{currentChar};
while (isNum(peekNextChar()))
value += eatNextChar();
...
}
...
}
If the seen non-numeric character is not a '.'
,
a Number
token is returned with the eaten
digits as its value. Otherwise, the lexer proceeds with
concatenating the decimal point to the number.
Token TheLexer::getNextToken() {
...
if (isNum(currentChar)) {
...
if (peekNextChar() != '.')
return Token{tokenStartLocation, TokenKind::Number, value};
value += eatNextChar();
...
}
...
}
If the character immediately following the
'.'
is not a number, an Unk
token
is returned. Otherwise, the fractional part of the number is
consumed and the lexer returns the corresponding
Number
token.
Token TheLexer::getNextToken() {
...
if (isNum(currentChar)) {
...
if (!isNum(peekNextChar()))
return Token{tokenStartLocation, TokenKind::Unk};
while (isNum(peekNextChar()))
value += eatNextChar();
return Token{tokenStartLocation, TokenKind::Number, value};
}
...
}
In the AST a NumberLiteral
is an
Expr
, which holds the textual value of the
number.
struct NumberLiteral : public Expr {
std::string value;
NumberLiteral(SourceLocation location, std::string value)
: Expr(location),
value(value) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node is the name of the expression followed by the contained value.
void NumberLiteral::dump(size_t level) const {
std::cerr << indent(level) << "NumberLiteral: '" << value << "'\n";
}
To keep the parser simple, this node doesn't have a
dedicated parser method, instead it is parsed directly in
parsePrimary()
.
std::unique_ptr<Expr> Parser::parsePrimary() {
SourceLocation location = nextToken.location;
if (nextToken.kind == TokenKind::Number) {
auto literal = std::make_unique<NumberLiteral>(location, *nextToken.value);
eatNextToken(); // eat number
return literal;
}
...
}
A declRefExpr
is a simple addition as it is
only a wrapper around an identifier
. Its rule
doesn't introduce any tokens, the lexer can't already
handle.
<declRefExpr>
::= <identifier>
The DeclRefExpr
node is derived from
Expr
and stores the identifier that is being
referenced.
struct DeclRefExpr : public Expr {
std::string identifier;
DeclRefExpr(SourceLocation location, std::string identifier)
: Expr(location),
identifier(identifier) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node contains its name along with the referenced identifier.
void DeclRefExpr::dump(size_t level) const {
std::cerr << indent(level) << "DeclRefExpr: " << identifier << '\n';
}
Because the parser only needs to eat one token, this node is
also parsed in parsePrimary()
.
std::unique_ptr<Expr> Parser::parsePrimary() {
...
if (nextToken.kind == TokenKind::Identifier) {
auto declRefExpr =
std::make_unique<DeclRefExpr>(location, *nextToken.value);
eatNextToken(); // eat identifier
return declRefExpr;
}
...
}
There are no other primary expressions for now, so if
parsePrimary()
sees any other token it reports
an error.
std::unique_ptr<Expr> Parser::parsePrimary() {
...
return report(location, "expected expression");
}
The callExpr
is categorized as a postfix
expression because, in the source code it is an argument
list standing behind a primary expression.
<postfixExpression>
::= <primaryExpression> <argumentList>
<argumentList>
::= '(' (<expr> (',' <expr>)* ','?)? ')'
The arguments of a function call are expressions, so
variable references and values returned by other function
calls can also be passed on to a different function like
foo(0, x, bar());
.
Design Note
A question might arise, why is there an optional
','
at the end of the argument list? That is called a trailing comma and allows developers to write snippets likefoo(1.0, x,);
.This feature can come in handy when the argument list spans across multiple lines and a new argument needs to be added later. Because the previous line doesn't have to be modified, the version control diff is smaller.
foo( 0, 1, + 2, );
It also makes it easier to reorder the arguments or generate source files with function calls if needed.
At the end of the day using the trailing comma is up to personal preferences, but since they have more than one pros and no cons, there is no reason not to support them.
The argumentList
rule also introduces a new
single character token, the ,
which needs to be
taught to the lexer.
constexpr char singleCharTokens[] = {..., ','};
enum class TokenKind : char {
...
Comma = singleCharTokens[7],
};
The CallExpr
node stores the callee and a
vector of expressions for the arguments.
struct CallExpr : public Expr {
std::unique_ptr<Expr> callee;
std::vector<std::unique_ptr<Expr>> arguments;
CallExpr(SourceLocation location,
std::unique_ptr<Expr> callee,
std::vector<std::unique_ptr<Expr>> arguments)
: Expr(location),
callee(std::move(callee)),
arguments(std::move(arguments)) {}
void dump(size_t level = 0) const override;
};
The textual representation of a
CallExpr
includes its name, the callee and its
arguments.
void CallExpr::dump(size_t level) const {
std::cerr << indent(level) << "CallExpr:\n";
callee->dump(level + 1);
for (auto &&arg : arguments)
arg->dump(level + 1);
}
Being the only postfix expression in the language, this node
is parsed in the
parsePostfixExpr()
method, which starts with
parsing a primary expression and returning it if it's not
followed by a (
.
std::unique_ptr<Expr> Parser::parsePostfixExpr() {
varOrReturn(expr, parsePrimary());
if (nextToken.kind != TokenKind::Lpar)
return expr;
...
}
If the next token is a (
, the source location
of the call is the location of that token while the
argumentList
is parsed by a separate helper
method and the previously parsed primary acts as the callee.
std::unique_ptr<Expr> Parser::parsePostfixExpr() {
...
SourceLocation location = nextToken.location;
varOrReturn(argumentList, parseArgumentList());
return std::make_unique<CallExpr>(location, std::move(expr),
std::move(*argumentList));
}
The parseArgumentList()
helper is responsible
for handling the argument list, which doesn't have an AST
node. Unlike the core parser functions, this helper doesn't
expect the next token to already be the first one in the
rule, instead, it validates the token itself. To keep the
interface of the parser uniform, this helper also returns a
std::unique_ptr<>
.
std::unique_ptr<std::vector<std::unique_ptr<Expr>>>
Parser::parseArgumentList() {
matchOrReturn(TokenKind::Lpar, "expected '('");
eatNextToken(); // eat '('
...
}
After consuming the (
, the tokens are processed
until a )
is encountered, which indicates the
end of the argument list.
std::unique_ptr<std::vector<std::unique_ptr<Expr>>>
Parser::parseArgumentList() {
...
while (true) {
if (nextToken.kind == TokenKind::Rpar)
break;
...
}
...
}
If the next token is not a )
, the parser keeps
gathering the expressions in an argument list until one of
them is not followed by a ,
. Since arguments
are separated by commas, if the ,
is missing
after one of them, it is assumed to be the last expression
in the list.
std::unique_ptr<std::vector<std::unique_ptr<Expr>>>
Parser::parseArgumentList() {
...
std::vector<std::unique_ptr<Expr>> argumentList;
while (true) {
...
varOrReturn(expr, parseExpr());
argumentList.emplace_back(std::move(expr));
if (nextToken.kind != TokenKind::Comma)
break;
eatNextToken(); // eat ','
}
...
}
Finally, the end of the argument list is validated and if everything is successful, the parsed arguments are returned.
std::unique_ptr<std::vector<std::unique_ptr<Expr>>>
Parser::parseArgumentList() Parser::parseArgumentList() {
...
matchOrReturn(TokenKind::Rpar, "expected ')'");
eatNextToken(); // eat ')'
return std::make_unique<std::vector<std::unique_ptr<Expr>>>(std::move(argumentList));
}
For now, in the grammar, the expr
rule acts as
a wrapper around postfixExpression
.
<expr>
::= <postfixExpression>
The parser attempts to model the grammar accurately, so
parseExpr()
also acts as a wrapper around
parsePostfixExpr()
.
std::unique_ptr<Expr> Parser::parseExpr() {
return parsePostfixExpr();
}
This function is called from parseStmt()
which
also ensures that a ;
is placed at the end of
the expression.
std::unique_ptr<Stmt> Parser::parseStmt() {
...
varOrReturn(expr, parseExpr());
matchOrReturn(TokenKind::Semi, "expected ';' at the end of expression");
eatNextToken(); // eat ';'
return expr;
}
With the added support for CallExpr
, functions
should also be extended with parameters. A parameter in the
grammar is simply an identifier
followed by a
:
and a type
.
<paramDecl>
::= <identifier> ':' <type>
A parameter is represented by a ParamDecl
node,
which in addition to the fields in the
Decl
base class, also stores the type of the
parameter.
struct ParamDecl : public Decl {
Type type;
ParamDecl(SourceLocation location, std::string identifier, Type type)
: Decl(location, std::move(identifier)),
type(std::move(type)) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node is the name of the node followed by the identifier and the type of the declared parameter.
void ParamDecl::dump(size_t level) const {
std::cerr << indent(level) << "ParamDecl: " << identifier << ':' << type.name
<< '\n';
}
Parsing this declaration is just checking if the three tokens follow each other in the correct order.
std::unique_ptr<ParamDecl> Parser::parseParamDecl() {
SourceLocation location = nextToken.location;
std::string identifier = *nextToken.value;
eatNextToken(); // eat identifier
matchOrReturn(TokenKind::Colon, "expected ':'");
eatNextToken(); // eat :
varOrReturn(type, parseType());
return std::make_unique<ParamDecl>(location, std::move(identifier),
std::move(*type));
}
With the introduction of the Number
token, the
number
type can also be parsed, which is the
only supported type for function parameters.
std::optional<Type> Parser::parseType() {
...
if (kind == TokenKind::KwNumber) {
eatNextToken(); // eat 'number'
return Type::builtinNumber();
}
...
};
The grammar of the function declaration is updated to
accommodate the parameter declarations. The
'('
and ')'
characters are moved
from the functionDecl
to the
parameterList
rule. Trailing commas are allowed
at the end of a parameterList
too.
<functionDecl>
::= 'fn' <identifier> <parameterList> ':' <type> <block>
<parameterList>
::= '(' (<paramDecl> (',' <paramDecl>)* ','?)? ')'
In the FunctionDecl
node, the parameters are
stored in a std::vector
.
struct FunctionDecl : public Decl {
...
std::vector<std::unique_ptr<ParamDecl>> params;
...
FunctionDecl(...,
std::vector<std::unique_ptr<ParamDecl>> params,
...)
: ...,
params(std::move(params)),
... {}
...
}
They are also included in the textual representation of the node before the body.
void FunctionDecl::dump(size_t level) const {
...
for (auto &¶m : params)
param->dump(level + 1);
body->dump(level + 1);
}
Instead of matching (
and )
, the
parseFunctionDecl()
method parses a parameter
list, which is later used in the creation of the
FunctionDecl
node.
std::unique_ptr<FunctionDecl> Parser::parseFunctionDecl() {
...
eatNextToken(); // eat identifier
varOrReturn(parameterList, parseParameterList());
matchOrReturn(TokenKind::Colon, "expected ':'");
...
return std::make_unique<FunctionDecl>(...,
std::move(*parameterList),
...);
}
The content of parseParameterList()
is almost
identical to parseArgumentList()
. First, it
ensures that the next token at the time of calling the
method is a (
.
std::unique_ptr<std::vector<std::unique_ptr<ParamDecl>>>
Parser::parseParameterList( Parser::parseParameterList() {
matchOrReturn(TokenKind::Lpar, "expected '('");
eatNextToken(); // eat '('
...
}
Then similarly to expressions in argument lists, every
ParamDecl
is collected into a list.
std::unique_ptr<std::vector<std::unique_ptr<ParamDecl>>>
Parser::parseParameterList() {
...
std::vector<std::unique_ptr<ParamDecl>> parameterList;
while (true) {
if (nextToken.kind == TokenKind::Rpar)
break;
matchOrReturn(TokenKind::Identifier, "expected parameter declaration");
varOrReturn(paramDecl, parseParamDecl());
parameterList.emplace_back(std::move(paramDecl));
if (nextToken.kind != TokenKind::Comma)
break;
eatNextToken(); // eat ','
}
...
}
Finally the closing )
is matched and the
parameter list is returned.
std::unique_ptr<std::vector<std::unique_ptr<ParamDecl>>>
Parser::parseParameterList() {
...
matchOrReturn(TokenKind::Rpar, "expected ')'");
eatNextToken(); // eat ')'
return std::make_unique<std::vector<std::unique_ptr<ParamDecl>>>(std::move(parameterList));
}
Now that the language has more elements besides function declarations, it's worth dedicating some time to make the error recovery smarter.
With the current modification, there are 4 tokens, that can act as synchronization points in case of an error.
KwFn
}
;
Eof
The parser synchronizes on these points after failing to
parse a statement inside a Block
. In that case,
the token on which the parser is synchronized is
automatically selected by the
synchronize()
method.
std::unique_ptr<Block> Parser::parseBlock() {
...
while (true) {
...
if (!stmt) {
synchronize();
continue;
}
...
}
...
}
The synchronize()
method flags the current AST
as incomplete and keeps eating tokens until one of the
previously listed synchronization points is encountered.
void Parser::synchronize() {
incompleteAST = true;
...
while (true) {
...
eatNextToken();
}
}
If the parser fails inside a Block
, the
synchronization point is either the end of that block or the
token after the end of the first direct nested block.
Indirect nested blocks are ignored.
fn foo(): void { fail } fn ...
┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌─────┐
│ ... │ │ { │ │ ... │ │ } │ │ ... │
└─────┘ └───┘ └─────┘ └───┘ └─────┘
^ ^
| └ synchronize here
└ fail here
fn foo(): void { if fail { stmt; } stmt; } fn ...
┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌─────┐ ┌───┐
│ ... │ │ { │ │ ... │ │ { │ │ ... │ │ } │ │ ... │ │ } │
└─────┘ └───┘ └─────┘ └───┘ └─────┘ └───┘ └─────┘ └───┘
^ ^
| synchronize here ┘
└ fail here
fn foo(): void { if fail { if cond {} } stmt; } fn ...
┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌───┐ ┌───┐ ┌─────┐ ┌───┐
│ ... │ │ { │ │ ... │ │ { │ │ } │ │ } │ │ ... │ │ } │
└─────┘ └───┘ └─────┘ └───┘ └───┘ └───┘ └─────┘ └───┘
^ ^ ^
| ignore ┘ |
└ fail here synchronize here ┘
To achieve this, the parser keeps track of how many
{
it has seen so far.
void Parser::synchronize() {
...
int braces = 0;
while (true) {
TokenKind kind = nextToken.kind;
if (kind == TokenKind::Lbrace) {
++braces;
}
...
}
}
Once a }
is encountered and the number of
braces is 0
, the parser is at the end of the
block, so it's synchronized.
void Parser::synchronize() {
...
while (true) {
...
else if (kind == TokenKind::Rbrace) {
if (braces == 0)
break;
...
}
...
}
}
If the number of braces is
1
, it is the end of the first direct nested
block, so the token is eaten and the parser is at the start
of the next statement.
void Parser::synchronize() {
...
else if (kind == TokenKind::Rbrace) {
...
if (braces == 1) {
eatNextToken(); // eat '}'
break;
}
...
}
...
}
Otherwise, the parser is inside an indirect nested block, so
it skips it by decreasing braces
by one.
void Parser::synchronize() {
...
else if (kind == TokenKind::Rbrace) {
...
--braces;
}
...
}
If a statement fails, synchronization happens after the next
;
unless the ;
is in a nested
block.
┌─────┐ ┌───┐ ┌─────┐ ┌───┐ ┌───┐ ┌─────┐
│ ... │ │ { │ │ ... │ │ ; │ │ } │ │ ... │
└─────┘ └───┘ └─────┘ └───┘ └───┘ └─────┘
^ ^
| synchronize here ┘
└ fail here
┌─────┐ ┌───┐ ┌─────┐
│ ... │ │ ; │ │ ... │
└─────┘ └───┘ └─────┘
^ ^
| └ synchronize here
└ fail here
In other words, if the next token is ;
and the
number of {
seen is 0
.
void Parser::synchronize() {
...
while (true) {
...
} else if (kind == TokenKind::Semi && braces == 0) {
eatNextToken(); // eat ';'
break;
}
...
}
If neither of the above conditions are met the parser
synchronizes on KwFn
or Eof
.
void Parser::synchronize() {
...
while (true) {
...
} else if (kind == TokenKind::KwFn || kind == TokenKind::Eof)
break;
...
}