The last step of the compilation pipeline on the frontend is reasoning about the meaning of the source code. A snippet might be accepted by the parser, but it still has no meaning.
fn wrong(x: number): void {
println(y);
}
The above snippet is correct according to the grammar,
however, it's unknown what y
refers to. Is it a
parameter? Is it a function, which the developer forgot to
call? The semantic analyzer of a compiler is searching for
the answer to these questions.
The process of figuring out what a name in the source code refers to is called name resolution.
fn printX(x: number): void {
^
│
println(x); // 'x' refers to the parameter 'x'
}
The idea is to traverse the AST and record every identifier introduced by a declaration into a so-called symbol table. The identifiers can also be called symbols, hence the symbol table name.
When a reference to a symbol is encountered, it is checked whether it has already been recorded. If the symbol is not found in the symbol table it's assumed the developer forgot to declare it and the compiler reports an error.
fn function( // symbols: ['function']
x: number, // symbols: ['function', 'x']
y: number, // symbols: ['function', 'x', 'y']
): void {
x + z;
^ error: 'z' is not a known symbol
}
Design Note
The difficulty of this problem can vary based on how the language is designed. C and C++ require every symbol to be declared before it can be referenced, so in those languages, the source file can be traversed from top to bottom and when a symbol is not found an error can be safely reported.
Other languages such as Kotlin or JavaScript (and your language) however allow the use of a symbol before it is declared. The following snippet for example is a valid Kotlin code.
fun main() { foo() } fun foo() {}
Similarly, this other snippet is valid in JavaScript.
foo() function foo() {}
JavaScript also introduces a mechanism called hoisting, when
function
andvar
declarations are moved to the top of the source file before the source file is executed.Still even moving every declaration to the top of the source file doesn't work in case of 2 functions are referencing each other in their bodies.
function foo() { bar() } function bar() { foo() }
C and C++ solve this problem with the use of forward declarations, as in these languages a function can be declared any number of times, but it is only allowed to be defined exactly once.
void foo(); void bar() { foo(); } void foo() { bar(); }
One drawback of forward declarations is that the compiler has no way to figure out whether a function is defined or not. Catching that error happens after compilation in the linker.
In languages where forward declarations are not allowed, to properly tackle this problem, name resolution must be done in multiple passes.
Languages also define scopes, which limit the visibility of symbols. For example in C++ or Kotlin variables defined in the body of a condition are not visible outside that body.
void foo() {
if(true)
int x = 0;
x; // error: 'x' was not declared in this scope
}
In Python, however, the variable is visible until the end of the function.
def fun():
if True:
x = 1
print(x) # 'x' is visible
In your language function symbols declared on the top level of the source file live inside the global scope, which makes them visible to every part of the source file.
┌──────────────────────────┐
│ Global Scope: ['x', 'y'] │
├──────────────────────────┴────────────┐
│ fn x(n: number): void { println(n); } │
│ │
│ fn y(x: number): void { println(x); } │
└───────────────────────────────────────┘
Parameters on the other hand are only visible inside the parameter list they are declared in and the body of the corresponding function. The parameter of a specific function is not visible in any other function.
┌──────────────────────────┐
│ Global Scope: ['x', 'y'] │
├──────────────────────────┴─────────────────┐
│ ┌────────────────────┐ │
│ │ Param Scope: ['n'] │ │
│ ├────────────────────┴──────────────┐ │
│ fn x │ (n: number): void { println(n); } │ │
│ └───────────────────────────────────┘ │
│ │
│ ┌────────────────────┐ │
│ │ Param Scope: ['x'] │ │
│ ├────────────────────┴──────────────┐ │
│ fn y │ (x: number): void { println(x); } │ │
│ └───────────────────────────────────┘ │
└────────────────────────────────────────────┘
Blocks also get their scope, so that everything that's declared inside a block is not visible outside of it.
┌──────────────────────────┐
│ Global Scope: ['x', 'y'] │
├──────────────────────────┴─────────────────────┐
│ ┌────────────────────┐ │
│ │ Param Scope: ['n'] │ │
│ ├────────────────────┴──────────────────┐ │
│ │ ┌─────────────────┐ │ │
│ │ │ Block scope: [] │ │ │
│ │ ├─────────────────┤ │ │
│ fn x │ (n: number): void │ { println(n); } │ │ │
│ │ └─────────────────┘ │ │
│ └───────────────────────────────────────┘ │
│ │
│ ┌────────────────────┐ │
│ │ Param Scope: ['x'] │ │
│ ├────────────────────┴──────────────────┐ │
│ │ ┌─────────────────┐ │ │
│ │ │ Block Scope: [] │ │ │
│ │ ├─────────────────┤ │ │
│ fn y │ (x: number): void │ { println(x); } │ │ │
│ │ └─────────────────┘ │ │
│ └───────────────────────────────────────┘ │
└────────────────────────────────────────────────┘
Because scopes are nested inside each other and each scope needs to be aware of the declarations it contains, they are usually modelled as symbol tables linked after each other.
['x', 'y'] <- ['n'] <- []
^
└─ ['x'] <- []
During lookup, the scopes are scanned from the inner-most one to the outer-most one, so it's always the declaration in the deepest scope found first.
Lookup 'x' in the body of 'fn x()':
['x', 'y'] <- ['n'] <- [] <-- lookup from right to left
^
Lookup 'x' in the body of 'fn y()':
['x', 'y'] <- ['x'] <- [] <-- lookup from right to left
^
Design Note
The previously described scope is also called lexical scope as the visibility of the variables depends on the place where they are created in source code.
For this reason in the case of functions, symbol tables can be created and destroyed as the semantic analyzer traverses them from top to bottom, however, other elements such as classes require their symbol tables to be bound to the class itself. The reason for this is inheritance.
struct Base { int x; void foo(); }; struct Derived : public Base {};
Because
Derived
inherits fromBase
, it must have access to the symbols ofBase
. If the symbol table ofBase
is destroyed when the semantic analyzer finishes processing the class, the information is lost.
If a symbol at the point of a declaration is already inserted into the same scope, it's called a redeclaration. Inside the compiler, it is an error, as there is no way to know which of the symbols a reference points to.
// symbol tables: ['function'] <- ['x', 'x'] <- []
fn function(x: number, x: number): void {
^ ^
┌─┴──────────┘ // which 'x' is being printed?
println(x);
}
When the same symbol is already found in a higher symbol table it's called shadowing, as the deeper symbol is found first by the compiler and it hides the higher symbol.
// symbol tables: ['x', 'y'] <- ['x'] <- []
fn x(): void {}
^
└─┐ // the parameter shadows the function
fn y(x: number): void {
^
┌──┘ // the closest 'x' symbol is the parameter
x;
}
After resolution, the compiler should have an intermediate
representation that contains information about the meaning
of the source code. For example, a
DeclRefExpr
should no longer store a
std::string
that represents the identifier of
the declaration being referenced, but a pointer to the
corresponding Decl
node itself.
One approach to achieve this is to add additional fields to the current AST and fill those in during semantic analysis. The drawback is that nodes can be left in an invalid state by forgetting to set the correct information. Also, the AST becomes too heavy with multiple properties that are no longer needed after resolution.
Another approach is to build a dedicated representation that only contains the information needed after resolution. This is the approach being preferred in this compiler. The representation created is also a tree structure similar to the AST, in fact, it could also be called resolved AST or simply resolved tree.
The ResolvedStmt
node is identical to the AST
Stmt
as it also acts as the base class for
every resolved statement and stores the location of the node
and a dump()
method.
struct ResolvedStmt {
SourceLocation location;
ResolvedStmt(SourceLocation location)
: location(location) {}
virtual ~ResolvedStmt() = default;
virtual void dump(size_t level = 0) const = 0;
};
The ResolvedExpr
node is different as every
resolved expression must also have its type set.
struct ResolvedExpr : public ResolvedStmt {
Type type;
ResolvedExpr(SourceLocation location, Type type)
: ResolvedStmt(location),
type(type) {}
};
Similarly to resolved expressions, resolved declarations must also have their types set compared to their non-resolved counterparts.
struct ResolvedDecl {
SourceLocation location;
std::string identifier;
Type type;
ResolvedDecl(SourceLocation location, std::string identifier, Type type)
: location(location),
identifier(std::move(identifier)),
type(type) {}
virtual ~ResolvedDecl() = default;
virtual void dump(size_t level = 0) const = 0;
};
Since numbers are all double precision floating point
numbers, ResolvedNumberLiteral
stores a
double
instead of the textual representation of
its value.
struct ResolvedNumberLiteral : public ResolvedExpr {
double value;
ResolvedNumberLiteral(SourceLocation location, double value)
: ResolvedExpr(location, Type::builtinNumber()),
value(value) {}
void dump(size_t level = 0) const override;
};
Its textual representation also prints this value.
void ResolvedNumberLiteral::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedNumberLiteral: '" << value << "'\n";
}
ResolvedDeclRefExpr
stores a pointer to the
referenced ResolvedDecl
instead of its
identifier.
struct ResolvedDeclRefExpr : public ResolvedExpr {
const ResolvedDecl *decl;
ResolvedDeclRefExpr(SourceLocation location, ResolvedDecl &decl)
: ResolvedExpr(location, decl.type),
decl(&decl) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node also contains the address of the referenced declaration besides its identifier.
void ResolvedDeclRefExpr::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedDeclRefExpr: @(" << decl << ") "
<< decl->identifier << '\n';
}
The callee of a ResolvedCallExpr
is now a
pointer to a ResolvedFunctionDecl
instead of an
arbitrary expression.
struct ResolvedCallExpr : public ResolvedExpr {
const ResolvedFunctionDecl *callee;
std::vector<std::unique_ptr<ResolvedExpr>> arguments;
ResolvedCallExpr(SourceLocation location,
const ResolvedFunctionDecl &callee,
std::vector<std::unique_ptr<ResolvedExpr>> arguments)
: ResolvedExpr(location, callee.type),
callee(&callee),
arguments(std::move(arguments)) {}
void dump(size_t level = 0) const override;
};
In the textual representation of the node, the address of the function is printed too.
void ResolvedCallExpr::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedCallExpr: @(" << callee << ") "
<< callee->identifier << '\n';
for (auto &&arg : arguments)
arg->dump(level + 1);
}
The ResolvedBlock
node doesn't contain any
different information from its non-resolved counterpart.
struct ResolvedBlock {
SourceLocation location;
std::vector<std::unique_ptr<ResolvedStmt>> statements;
ResolvedBlock(SourceLocation location,
std::vector<std::unique_ptr<ResolvedStmt>> statements)
: location(location),
statements(std::move(statements)) {}
void dump(size_t level = 0) const;
};
Its textual representation is also identical to the AST
Block
node.
void ResolvedBlock::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedBlock\n";
for (auto &&stmt : statements)
stmt->dump(level + 1);
}
With moving more information to its base class,
ResolvedParamDecl
no longer contains any unique
fields.
struct ResolvedParamDecl : public ResolvedDecl {
ResolvedParamDecl(SourceLocation location, std::string identifier, Type type)
: ResolvedDecl{location, std::move(identifier), type} {}
void dump(size_t level = 0) const override;
};
The textual representation of the node also contains its address.
void ResolvedParamDecl::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedParamDecl: @(" << this << ") "
<< identifier << ':' << '\n';
}
ResolvedFunctionDecl
only stores a parameter
list and the body of the function. The rest of the fields
are stored in the ResolvedDecl
base class.
struct ResolvedFunctionDecl : public ResolvedDecl {
std::vector<std::unique_ptr<ResolvedParamDecl>> params;
std::unique_ptr<ResolvedBlock> body;
ResolvedFunctionDecl(SourceLocation location,
std::string identifier,
Type type,
std::vector<std::unique_ptr<ResolvedParamDecl>> params,
std::unique_ptr<ResolvedBlock> body)
: ResolvedDecl(location, std::move(identifier), type),
params(std::move(params)),
body(std::move(body)) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node contains its name, address, identifier, parameters and body.
void ResolvedFunctionDecl::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedFunctionDecl: @(" << this << ") "
<< identifier << ':' << '\n';
for (auto &¶m : params)
param->dump(level + 1);
body->dump(level + 1);
}
Identical to its non-resolved counterpart
ResolvedReturnStmt
stores only the optional
expression whose value is returned.
struct ResolvedReturnStmt : public ResolvedStmt {
std::unique_ptr<ResolvedExpr> expr;
ResolvedReturnStmt(SourceLocation location,
std::unique_ptr<ResolvedExpr> expr = nullptr)
: ResolvedStmt(location),
expr(std::move(expr)) {}
void dump(size_t level = 0) const override;
};
Its textual representation is also almost identical to its non-resolved counterpart.
void ResolvedReturnStmt::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedReturnStmt\n";
if (expr)
expr->dump(level + 1);
}
The semantic analysis logic is encapsulated by the
Sema
class. It takes the AST returned by the
parser and attempts to resolve it.
class Sema {
std::vector<std::unique_ptr<FunctionDecl>> ast;
public:
explicit Sema(std::vector<std::unique_ptr<FunctionDecl>> ast)
: ast(std::move(ast)) {}
std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolveAST();
};
A scope is represented by a vector of
ResolvedDecl*
, while the chain of scopes is
represented by a vector of scopes.
class Sema {
...
std::vector<std::vector<ResolvedDecl *>> scopes;
...
};
To make working with scopes more convenient, a helper object
is introduced. The
ScopeRAII
makes managing the scopes natural by
creating a new scope in its constructor and destroying it in
its destructor.
class Sema {
...
class ScopeRAII {
Sema *sema;
public:
explicit ScopeRAII(Sema *sema)
: sema(sema) {
sema->scopes.emplace_back();
}
~ScopeRAII() { sema->scopes.pop_back(); }
};
...
};
The lookupDecl()
method looks up a symbol based
on its identifier and returns the
ResolvedDecl
and the index of the scope the
declaration was found in. The index helps to determine
whether an identifier collision is a redeclaration or a
shadowing.
Design Note
With a more complex type system, it's also necessary to use the type of the symbol as well during the lookup, as a user-defined type might have the same name as a local variable.
class C {}; int C;
To look up a symbol, first, the symbol tables are iterated in reverse order while keeping track of the scope index, so the inner-most table is checked first.
std::pair<ResolvedDecl *, int> Sema::lookupDecl(const std::string id) {
int scopeIdx = 0;
for (auto it = scopes.rbegin(); it != scopes.rend(); ++it) {
...
++scopeIdx;
}
...
}
For every scope, the declarations inside that scope are iterated and if the identifier of any declaration matches the searched identifier, the declaration and the index of the scope are returned.
std::pair<ResolvedDecl *, int> Sema::lookupDecl(const std::string id) {
...
for (auto it = scopes.rbegin(); it != scopes.rend(); ++it) {
for (auto &&decl : *it) {
if (decl->identifier != id)
continue;
return {decl, scopeIdx};
}
...
}
...
}
If none of the scopes contains the declaration a
nullptr
is returned along with
-1
as the index of the containing scope.
std::pair<ResolvedDecl *, int> Sema::lookupDecl(const std::string id) {
...
return {nullptr, -1};
}
The insertDeclToCurrentScope()
method attempts
to insert the declaration into the inner-most scope. To
handle redeclarations, it first checks if a declaration with
the same identifier is already inserted into the current
scope chain.
bool Sema::insertDeclToCurrentScope(ResolvedDecl &decl) {
const auto &[foundDecl, scopeIdx] = lookupDecl(decl.identifier);
...
}
If the declaration is found in the inner-most scope an error
is reported for redeclaring the same symbol. The method also
returns false
in this case to indicate the
failure.
bool Sema::insertDeclToCurrentScope(ResolvedDecl &decl) {
...
if (foundDecl && scopeIdx == 0) {
report(decl.location, "redeclaration of '" + decl.identifier + '\'');
return false;
}
...
}
If the declaration is not found or it shadows another
declaration in an outer scope, it is inserted into the
inner-most one and true
is returned to indicate
success.
bool Sema::insertDeclToCurrentScope(ResolvedDecl &decl) {
...
scopes.back().emplace_back(&decl);
return true;
}
The source file is resolved in two passes. Because the order of the function declarations doesn't matter, first they are resolved without their bodies. This way all of the valid function symbols are inserted into the global scope.
Before the functions are resolved, however, the built-in
println()
function needs to be created and
inserted into the global scope. Creating this function first
allows the semantic analyzer to easily report an error if
the println()
is redeclared in the source code.
The resolveAST()
method creates the global
scope, the root of the resolved tree and inserts
println()
into both of them.
std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
std::vector<std::unique_ptr<ResolvedFunctionDecl>> resolvedTree;
auto println = createBuiltinPrintln();
ScopeRAII globalScope(this);
insertDeclToCurrentScope(*resolvedTree.emplace_back(std::move(println)));
...
}
The println()
function is created at the
<builtin>:0:0
location, with one parameter,
n: number
.
std::unique_ptr<ResolvedFunctionDecl> Sema::createBuiltinPrintln() {
SourceLocation loc{"<builtin>", 0, 0};
auto param =
std::make_unique<ResolvedParamDecl>(loc, "n", Type::builtinNumber());
std::vector<std::unique_ptr<ResolvedParamDecl>> params;
params.emplace_back(std::move(param));
...
};
The return type of the function is void
and its
body is empty. This function node in the resolved tree only
acts as a placeholder.
std::unique_ptr<ResolvedFunctionDecl> Sema::createBuiltinPrintln() {
...
auto block = std::make_unique<ResolvedBlock>(
loc, std::vector<std::unique_ptr<ResolvedStmt>>());
return std::make_unique<ResolvedFunctionDecl>(
loc, "println", Type::builtinVoid(), std::move(params), std::move(block));
};
After the built-in println()
function is
inserted into the resolved tree, the functions in the AST
are resolved one by one.
std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
...
for (auto &&fn : ast) {
auto resolvedFunctionDecl = resolveFunctionDeclaration(*fn);
...
}
...
}
If an error happens during the resolution of one of the functions or a function with the same name is declared multiple times, the error is recorded but the semantic analyzer keeps checking the other functions. This way independent errors are stacked together.
std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
...
bool error = false;
for (auto &&fn : ast) {
...
if (!resolvedFunctionDecl ||
!insertDeclToCurrentScope(*resolvedFunctionDecl)) {
error = true;
continue;
}
...
}
...
}
If a function is successfully resolved, it is inserted into the resolved tree.
std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
...
for (auto &&fn : ast) {
...
resolvedTree.emplace_back(std::move(resolvedFunctionDecl));
}
...
}
If any error happens during this pass, an empty resolved tree is returned.
std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveAST() {
...
if (error)
return {};
...
}
In the second pass, the body of every function in the
resolved tree is resolved. The first function in the tree is
the built-in
println()
, which can be skipped.
std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveSourceFile() {
...
for (size_t i = 1; i < resolvedTree.size(); ++i) {
...
}
...
}
The current function is also recorded inside the
Sema
class because to reason about some
statements the analyzer needs to know which function the
particular statement is in.
class Sema {
...
ResolvedFunctionDecl *currentFunction;
...
};
Next, the current function is set and a new scope is created
for its parameters, so they are visible to any
DeclRefExpr
that references them in the body.
std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveSourceFile() {
...
for (size_t i = 1; i < resolvedTree.size(); ++i) {
currentFunction = resolvedTree[i].get();
ScopeRAII paramScope(this);
for (auto &¶m : currentFunction->params)
insertDeclToCurrentScope(*param);
...
}
...
}
Then the body is resolved. This is the only place where a resolved node is modified after its creation, as its body has to be replaced with the resolved body. If any of the bodies cannot be resolved, the error is recorded.
std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveSourceFile() {
...
for (size_t i = 1; i < resolvedTree.size(); ++i) {
...
auto resolvedBody = resolveBlock(*ast[i - 1]->body);
if (!resolvedBody) {
error = true;
continue;
}
currentFunction->body = std::move(resolvedBody);
}
...
}
If an error happens during the resolution of any of the blocks an empty resolved tree is returned, otherwise the valid resolved tree is returned.
std::vector<std::unique_ptr<ResolvedFunctionDecl>> Sema::resolveSourceFile() {
...
if (error)
return {};
return resolvedTree;
}
With a simple type system like this one, a type is
automatically considered resolved if it is
void
or number
.
std::optional<Type> Sema::resolveType(Type parsedType) {
if (parsedType.kind == Type::Kind::Custom)
return std::nullopt;
return parsedType;
}
In the case of functions, the return type is resolved first. If the resolution fails, an error is reported.
std::unique_ptr<ResolvedFunctionDecl>
Sema::resolveFunctionDeclaration(const FunctionDecl &function) {
std::optional<Type> type = resolveType(function.type);
if (!type)
return report(function.location, "function '" + function.identifier +
"' has invalid '" +
function.type.name + "' type");
...
};
If the current function is the main()
function
it is ensured that it is declared as void
with
no parameters.
std::unique_ptr<ResolvedFunctionDecl>
Sema::resolveFunctionDeclaration(const FunctionDecl &function) {
...
if (function.identifier == "main") {
if (type->kind != Type::Kind::Void)
return report(function.location,
"'main' function is expected to have 'void' type");
if (!function.params.empty())
return report(function.location,
"'main' function is expected to take no arguments");
}
...
};
Next, each of the parameters is resolved. A new scope is
created for the parameters so that their redeclarations can
be detected easily. If a parameter cannot be resolved, or is
the redeclaration of a previous parameter the resolution of
the function fails and a nullptr
is returned.
std::unique_ptr<ResolvedFunctionDecl>
Sema::resolveFunctionDeclaration(const FunctionDecl &function) {
...
std::vector<std::unique_ptr<ResolvedParamDecl>> resolvedParams;
ScopeRAII paramScope(this);
for (auto &¶m : function.params) {
auto resolvedParam = resolveParamDecl(*param);
if (!resolvedParam || !insertDeclToCurrentScope(*resolvedParam))
return nullptr;
resolvedParams.emplace_back(std::move(resolvedParam));
}
...
};
If everything is successful, a
ResolvedFunctionDecl
is returned without a body
node, which is added later.
std::unique_ptr<ResolvedFunctionDecl>
Sema::resolveFunctionDeclaration(const FunctionDecl &function) {
...
return std::make_unique<ResolvedFunctionDecl>(
function.location, function.identifier, *type, std::move(resolvedParams),
nullptr);
};
For a ParamDecl
only its type needs to be
resolved. If the resolution fails or the type is
void
an error is reported.
std::unique_ptr<ResolvedParamDecl>
Sema::resolveParamDecl(const ParamDecl ¶m) {
std::optional<Type> type = resolveType(param.type);
if (!type || type->kind == Type::Kind::Void)
return report(param.location, "parameter '" + param.identifier +
"' has invalid '" + param.type.name +
"' type");
return std::make_unique<ResolvedParamDecl>(param.location, param.identifier,
*type);
}
Resolving a block is equivalent to resolving all the statements inside that block. For each block, a new scope is created and if an error happens during the resolution of one statement, the semantic analyzer keeps resolving the remaining statements. This allows the compiler to report multiple error messages together at the cost of some introduced false positives.
std::unique_ptr<ResolvedBlock> Sema::resolveBlock(const Block &block) {
std::vector<std::unique_ptr<ResolvedStmt>> resolvedStatements;
bool error = false;
...
ScopeRAII blockScope(this);
for (auto &&stmt : block.statements) {
auto resolvedStmt = resolveStmt(*stmt);
error |= !resolvedStatements.emplace_back(std::move(resolvedStmt));
if (error)
continue;
...
}
if (error)
return nullptr;
return std::make_unique<ResolvedBlock>(block.location,
std::move(resolvedStatements));
}
This function is the place where unreachable statements can be detected too. After the first return statement in a block, every other statement is considered unreachable, but a warning should only be reported for the first such statement.
std::unique_ptr<ResolvedBlock> Sema::resolveBlock(const Block &block) {
...
int reportUnreachableCount = 0;
...
for (auto &&stmt : block.statements) {
...
if (reportUnreachableCount == 1) {
report(stmt->location, "unreachable statement", true);
++reportUnreachableCount;
}
if (dynamic_cast<ReturnStmt *>(stmt.get()))
++reportUnreachableCount;
}
...
}
Design Note
Some languages like C or C++ allow the declaration of labels to which the program can jump. If a label appears after a return statement, only the statements after the return and before the label are unreachable.
void foo() { return; int unreachable; label: int mightBeReachable; }
The resolveStmt()
method drives the resolution
of the various statements. It checks what the current
Stmt
is, and dispatches the corresponding
resolver method. Because this method is responsible for
handling every statement, its end should be unreachable,
which is marked with the
llvm_unreachable()
utility.
std::unique_ptr<ResolvedStmt> Sema::resolveStmt(const Stmt &stmt) {
if (auto *expr = dynamic_cast<const Expr *>(&stmt))
return resolveExpr(*expr);
if (auto *returnStmt = dynamic_cast<const ReturnStmt *>(&stmt));
return resolveReturnStmt(*returnStmt);
llvm_unreachable("unexpected statement");
}
For a ReturnStmt
it needs to be checked whether
it's in the correct form. Void functions are not allowed to
return values, while non-void functions are expected to
return a value of the return type of the function.
std::unique_ptr<ResolvedReturnStmt>
Sema::resolveReturnStmt(const ReturnStmt &returnStmt) {
if (currentFunction->type.kind == Type::Kind::Void && returnStmt.expr)
return report(returnStmt.location,
"unexpected return value in void function");
if (currentFunction->type.kind != Type::Kind::Void && !returnStmt.expr)
return report(returnStmt.location, "expected a return value");
...
}
If the return statement is in the correct form, the expression, whose result is returned is resolved. If the expression cannot be resolved, or the type of the expression doesn't match the return type of the current function, an error is reported.
std::unique_ptr<ResolvedReturnStmt>
Sema::resolveReturnStmt(const ReturnStmt &returnStmt) {
...
std::unique_ptr<ResolvedExpr> resolvedExpr;
if (returnStmt.expr) {
resolvedExpr = resolveExpr(*returnStmt.expr);
if (!resolvedExpr)
return nullptr;
if (currentFunction->type.kind != resolvedExpr->type.kind)
return report(resolvedExpr->location, "unexpected return type");
}
return std::make_unique<ResolvedReturnStmt>(returnStmt.location,
std::move(resolvedExpr));
}
For expression resolution resolveExpr()
acts as
the driver. It checks what the current expression is and
decides how the semantic analyzer should proceed. As it is
responsible for handling every expression, the end of the
function is unreachable similar to
resolveStmt()
.
std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
...
llvm_unreachable("unexpected expression");
}
A NumberLiteral
is resolved by its value being
converted into a double
. The grammar ensures
that its value is always a correct double literal, so the
conversion always succeeds.
std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
if (const auto *number = dynamic_cast<const NumberLiteral *>(&expr))
return std::make_unique<ResolvedNumberLiteral>(number->location,
std::stod(number->value));
...
}
Resolving a DeclRefExpr
is a more complex
process, so it gets a dedicated method that handles it.
std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
...
if (const auto *declRefExpr = dynamic_cast<const DeclRefExpr *>(&expr))
return resolveDeclRefExpr(*declRefExpr);
...
}
First resolveDeclRefExpr()
looks up the
referenced symbol. If the symbol is not found, it wasn't
declared in the source code, so an error is reported.
std::unique_ptr<ResolvedDeclRefExpr>
Sema::resolveDeclRefExpr(const DeclRefExpr &declRefExpr, ...) {
ResolvedDecl *decl = lookupDecl(declRefExpr.identifier).first;
if (!decl)
return report(declRefExpr.location,
"symbol '" + declRefExpr.identifier + "' not found");
...
}
The method also takes a flag with a default value set to
false
indicating if it's the callee of a
function call.
class Sema {
...
std::unique_ptr<ResolvedDeclRefExpr>
resolveDeclRefExpr(const DeclRefExpr &declRefExpr, bool isCallee = false);
...
};
This flag allows the semantic analyzer to catch the case when a function declaration is referenced outside a call expression. In that case, the programmer must have forgotten to call it and this error can be reported.
std::unique_ptr<ResolvedDeclRefExpr>
Sema::resolveDeclRefExpr(const DeclRefExpr &declRefExpr, bool isCallee) {
...
if (!isCallee && dynamic_cast<ResolvedFunctionDecl *>(decl))
return report(declRefExpr.location,
"expected to call function '" + declRefExpr.identifier + "'");
...
}
If no problem is found, the
ResolvedDeclRefExpr
is returned storing a
pointer to the resolved declaration.
std::unique_ptr<ResolvedDeclRefExpr>
Sema::resolveDeclRefExpr(const DeclRefExpr &declRefExpr, bool isCallee) {
...
return std::make_unique<ResolvedDeclRefExpr>(declRefExpr.location, *decl);
}
For a CallExpr
first, it is checked that the
callee is indeed a reference to a declaration and it's not
an arbitrary expression being called.
std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
const auto *dre = dynamic_cast<const DeclRefExpr *>(call.callee.get());
if (!dre)
return report(call.location, "expression cannot be called as a function");
...
}
If the callee is a reference to a declaration, it is resolved first, so that a possible calling of an unknown symbol is caught and then it is checked if this resolved reference is for a function declaration.
std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
...
varOrReturn(resolvedCallee, resolveDeclRefExpr(*call.identifier, true));
const auto *resolvedFunctionDecl =
dynamic_cast<const ResolvedFunctionDecl *>(resolvedCallee->decl);
if (!resolvedFunctionDecl)
return report(call.location, "calling non-function symbol");
...
}
Then it's checked if the number of arguments in the call is equivalent to the number of parameters the called function has. If the argument count doesn't match, the semantic analyzer errors out.
std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
...
if (call.arguments.size() != resolvedFunctionDecl->params.size())
return report(call.location, "argument count mismatch in function call");
...
}
When the argument count matches, each of the arguments is resolved and its type is compared against the type of the corresponding parameter in the called function. If one of the arguments has a mismatching type, an error is reported.
std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
...
std::vector<std::unique_ptr<ResolvedExpr>> resolvedArguments;
int idx = 0;
for (auto &&arg : call.arguments) {
varOrReturn(resolvedArg, resolveExpr(*arg));
if (resolvedArg->type.kind != resolvedFunctionDecl->params[idx]->type.kind)
return report(resolvedArg->location, "unexpected type of argument");
++idx;
resolvedArguments.emplace_back(std::move(resolvedArg));
}
...
}
If both the callee and the arguments are resolved
successfully, a ResolvedCallExpr
is returned.
std::unique_ptr<ResolvedCallExpr> Sema::resolveCallExpr(const CallExpr &call) {
...
return std::make_unique<ResolvedCallExpr>(
call.location, *resolvedFunctionDecl, std::move(resolvedArguments));
}