User-defined variables allow the programmer to store and manipulate values in memory. They are also essential for some other constructs to work properly. Right now the language doesn't have any way to modify a value, so for example loops can only be infinitely running or not running at all.
The language supports two kinds of variables, mutable and immutable. Immutable variables cannot be changed over time, once they are assigned a value, the value remains the same for the entire lifetime of the variable. This allows the compiler to do further optimization and reason about values better.
fn main(): void {
let x = 3;
...
// 'x' is immutable, the condition is always true
if x < 4 {
....
}
}
Design Note
Reasoning about immutable values is only possible if there really is no way to mutate them. For example, in C++ it is possible to have
const
variables, which means that the value of the variable cannot be changed, but the language also providesconst_cast<>()
, which allows the mutation of such values.const int x = 3; const int *ptr = &x; *const_cast<int*>(ptr) = 2; assert(ptr == &x && *ptr == 2);
Even though the result of the above snippet is undefined behaviour, the compiler cannot assume that the value never changes, as it's simply not true.
Immutable variables can be defined with the
let
keyword, while mutable variables are
defined with the var
keyword.
let immutable ...;
var mutable ...;
The statement above is called a declaration statement, as it's a statement that declares a variable.
<declStmt>
::= ('let' | 'var') <varDecl> ';'
In a variable declaration, the identifier is followed by a type annotation and an optional initializer. If an initializer is given, however, the type annotation is not necessary, because the compiler can figure out the type from the initializer.
let x: number; // type annotation necessary
var y = 1; // type inferred from the initializer
In the grammar, this is described by the following rule.
<varDecl>
::= <identifier> (':' <type>)? ('=' <expr>)?
The previously seen grammar rules introduce the
let
and var
keywords and the
=
token, which need to be added to the lexer.
enum class TokenKind : char {
...
Equal,
KwLet,
KwVar,
...
};
const std::unordered_map<std::string_view, TokenKind> keywords = {
...,
{"let", TokenKind::KwLet}, {"var", TokenKind::KwVar}};
The =
token changes the logic of
getNextToken()
too because so far it can only
lex ==
. It is modified such that if the current
character is '='
, it peeks the next one and if
that is not an '='
the lexer returns the
=
token, otherwise, it returns ==
.
Token Lexer::getNextToken() {
...
if (currentChar == '=') {
if (peekNextChar() != '=')
return Token{tokenStartLocation, TokenKind::Equal};
eatNextChar();
return Token{tokenStartLocation, TokenKind::EqualEqual};
}
...
}
The AST of a variable declaration encapsulates its identifier, the optional type annotation and initializer and whether the variable is mutable or not.
struct VarDecl : public Decl {
std::optional<Type> type;
std::unique_ptr<Expr> initializer;
bool isMutable;
VarDecl(SourceLocation location,
std::string identifier,
std::optional<Type> type,
bool isMutable,
std::unique_ptr<Expr> initializer = nullptr)
: Decl(location, std::move(identifier)),
type(std::move(type)),
initializer(std::move(initializer)),
isMutable(isMutable) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node contains its identifier and the type annotation along with the initializer if any of them are present.
void VarDecl::dump(size_t level) const {
std::cerr << indent(level) << "VarDecl: " << identifier;
if (type)
std::cerr << ':' << type->name;
std::cerr << '\n';
if (initializer)
initializer->dump(level + 1);
}
The AST of the declaration statement is just a wrapper
around a VarDecl
.
struct DeclStmt : public Stmt {
std::unique_ptr<VarDecl> varDecl;
DeclStmt(SourceLocation location, std::unique_ptr<VarDecl> varDecl)
: Stmt(location),
varDecl(std::move(varDecl)) {}
void dump(size_t level = 0) const override;
};
Its textual representation also contains the wrapped
VarDecl
only.
void DeclStmt::dump(size_t level) const {
std::cerr << indent(level) << "DeclStmt:\n";
varDecl->dump(level + 1);
}
The parsing of the DeclStmt
is driven by the
parseStmt()
method.
std::unique_ptr<Stmt> Parser::parseStmt() {
...
if (nextToken.kind == TokenKind::KwLet || nextToken.kind == TokenKind::KwVar)
return parseDeclStmt();
...
}
First, the current token is stored, because it is needed to figure out if the declared variable is mutable or not.
std::unique_ptr<DeclStmt> Parser::parseDeclStmt() {
Token tok = nextToken;
...
}
The leading let
or var
keyword is
eaten and it's checked if the upcoming token is an
identifier before the VarDecl
is parsed. The
parseVarDecl()
method needs to be provided with
the information if it's inside a let
statement.
std::unique_ptr<DeclStmt> Parser::parseDeclStmt() {
...
eatNextToken(); // eat 'let' | 'var'
matchOrReturn(TokenKind::Identifier, "expected identifier");
varOrReturn(varDecl, parseVarDecl(tok.kind == TokenKind::KwLet));
...
}
After parsing the variable declaration, the trailing
;
is matched and if everything is successful, a
DeclStmt
node is returned.
std::unique_ptr<DeclStmt> Parser::parseDeclStmt() {
...
matchOrReturn(TokenKind::Semi, "expected ';' after declaration");
eatNextToken(); // eat ';'
return std::make_unique<DeclStmt>(tok.location, std::move(varDecl));
}
The parseVarDecl()
method takes a flag as a
parameter that shows if the variable is inside a
let
statement. Parsing the variable declaration
is a bit more complex since both the type annotation and the
initializer are optional. It is known however that the first
token is an identifier.
std::unique_ptr<VarDecl> Parser::parseVarDecl(bool isLet) {
SourceLocation location = nextToken.location;
std::string identifier = *nextToken.value;
eatNextToken(); // eat identifier
...
}
Then the parser checks if the identifier is followed by a
:
, which indicates a type annotation. If this
is the case, the type annotation is parsed.
std::unique_ptr<VarDecl> Parser::parseVarDecl(bool isLet) {
...
std::optional<Type> type;
if (nextToken.kind == TokenKind::Colon) {
eatNextToken(); // eat ':'
type = parseType();
if (!type)
return nullptr;
}
...
}
Once the type is handled, the parser checks if there is an
=
token, which indicates an initializer. If it
doesn't find such a token, it returns the AST node,
otherwise, it parses the initializer and returns a node
including that too.
std::unique_ptr<VarDecl> Parser::parseVarDecl(bool isLet) {
...
if (nextToken.kind != TokenKind::Equal)
return std::make_unique<VarDecl>(location, identifier, type, !isLet);
eatNextToken(); // eat '='
varOrReturn(initializer, parseExpr());
return std::make_unique<VarDecl>(location, identifier, type, !isLet,
std::move(initializer));
}
The resolved nodes are identical to their non-resolved counterparts, except for containing other resolved nodes.
The ResolvedVarDecl
node no longer stores an
optional type, as it must have its type set by the semantic
analyzer.
struct ResolvedVarDecl : public ResolvedDecl {
std::unique_ptr<ResolvedExpr> initializer;
bool isMutable;
ResolvedVarDecl(SourceLocation location,
std::string identifier,
Type type,
bool isMutable,
std::unique_ptr<ResolvedExpr> initializer = nullptr)
: ResolvedDecl(location, std::move(identifier), type),
initializer(std::move(initializer)),
isMutable(isMutable) {}
void dump(size_t level = 0) const override;
};
Since it's a declaration, its textual representation contains its address as well as its identifier and optional initializer.
void ResolvedVarDecl::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedVarDecl: @(" << this << ") "
<< identifier << ':' << '\n';
if (initializer)
initializer->dump(level + 1);
}
The resolved node of a DeclStmt
is also just a
wrapper around a VarDecl
as its AST node.
struct ResolvedDeclStmt : public ResolvedStmt {
std::unique_ptr<ResolvedVarDecl> varDecl;
ResolvedDeclStmt(SourceLocation location,
std::unique_ptr<ResolvedVarDecl> varDecl)
: ResolvedStmt(location),
varDecl(std::move(varDecl)) {}
void dump(size_t level = 0) const override;
};
Their textual representations are also identical.
void ResolvedDeclStmt::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedDeclStmt:\n";
varDecl->dump(level + 1);
}
The resolution of the DeclStmt
is driven by
resolveStmt()
.
std::unique_ptr<ResolvedStmt> Sema::resolveStmt(const Stmt &stmt) {
...
if (auto *declStmt = dynamic_cast<const DeclStmt *>(&stmt))
return resolveDeclStmt(*declStmt);
...
}
First, the VarDecl
is resolved, and if the
resolution is successful, the variable is inserted into the
current scope.
std::unique_ptr<ResolvedDeclStmt>
Sema::resolveDeclStmt(const DeclStmt &declStmt) {
varOrReturn(resolvedVarDecl, resolveVarDecl(*declStmt.varDecl));
if (!insertDeclToCurrentScope(*resolvedVarDecl))
return nullptr;
return std::make_unique<ResolvedDeclStmt>(declStmt.location,
std::move(resolvedVarDecl));
}
A VarDecl
must either have a type annotation or
an initializer. If both of them are missing, an error is
reported.
std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
if (!varDecl.type && !varDecl.initializer)
return report(
varDecl.location,
"an uninitialized variable is expected to have a type specifier");
...
}
To be able to infer the type, if there is an initializer it has to be resolved first.
std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
...
std::unique_ptr<ResolvedExpr> resolvedInitializer = nullptr;
if (varDecl.initializer) {
resolvedInitializer = resolveExpr(*varDecl.initializer);
if (!resolvedInitializer)
return nullptr;
}
...
}
The type that needs to be resolved is either the type of the variable or if it's not specified, the type of the initializer.
std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
...
Type resolvableType = varDecl.type.value_or(resolvedInitializer->type);
std::optional<Type> type = resolveType(resolvableType);
...
}
If the type cannot be resolved, or it is
void
an error is reported to the user.
std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
...
if (!type || type->kind == Type::Kind::Void)
return report(varDecl.location, "variable '" + varDecl.identifier +
"' has invalid '" +
resolvableType.name + "' type");
...
}
Finally, the case is handled when there is both an initializer and a type specifier, but they are of different types. The initializer is also evaluated by the constant expression evaluator.
std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
...
if (resolvedInitializer) {
if (resolvedInitializer->type.kind != type->kind)
return report(resolvedInitializer->location, "initializer type mismatch");
resolvedInitializer->setConstantValue(cee.evaluate(*resolvedInitializer, false));
}
...
}
If everything is successful, the resolved node is returned.
std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
...
return std::make_unique<ResolvedVarDecl>(varDecl.location, varDecl.identifier,
*type, varDecl.isMutable,
std::move(resolvedInitializer));
}
With the introduction of variables, the components
responsible for code analysis can also be extended. In the
tree walk interpreter, it now makes sense to handle
DeclRefExpr
too. Before the introduction of
variables, this expression could only point to parameters
and functions, but none of those can be evaluated in compile
time.
std::optional<double>
ConstantExpressionEvaluator::evaluate(const ResolvedExpr &expr,
bool allowSideEffects) {
...
if (const auto *declRefExpr =
dynamic_cast<const ResolvedDeclRefExpr *>(&expr))
return evaluateDeclRefExpr(*declRefExpr, allowSideEffects);
...
}
If a variable is immutable and initialized upon declaration, its value is set by the initializer and doesn't change over time. Because of this, the value of the variable is known in compile time.
std::optional<double>
ConstantExpressionEvaluator::evaluateDeclRefExpr(const ResolvedDeclRefExpr &dre,
bool allowSideEffects) {
const auto *rvd = dynamic_cast<const ResolvedVarDecl *>(dre.decl);
if (!rvd || rvd->isMutable || !rvd->initializer)
return std::nullopt;
return evaluate(*rvd->initializer, allowSideEffects);
}
The control flow graph can also be extended with the new variable declarations, which extension becomes necessary soon.
int CFGBuilder::insertStmt(const ResolvedStmt &stmt, int block) {
...
if (auto *declStmt = dynamic_cast<const ResolvedDeclStmt *>(&stmt))
return insertDeclStmt(*declStmt, block);
...
}
Because the CFG is constructed in reverse order, from bottom
to top, first the DeclStmt
is inserted, then
the initializer if there's one.
int CFGBuilder::insertDeclStmt(const ResolvedDeclStmt &stmt, int block) {
cfg.insertStmt(&stmt, block);
if (const auto &init = stmt.varDecl->initializer)
return insertExpr(*init, block);
return block;
}
The LLVM IR generation is driven by
generateStmt()
.
llvm::Value *Codegen::generateStmt(const ResolvedStmt &stmt) {
...
if (auto *declStmt = dynamic_cast<const ResolvedDeclStmt *>(&stmt))
return generateDeclStmt(*declStmt);
...
}
When a variable is declared, first some space on the stack has to be allocated for it. If it's initialized, the value of the initializer is stored in the variable's stack space. Finally, the stack space of the variable is mapped to its declaration.
llvm::Value *Codegen::generateDeclStmt(const ResolvedDeclStmt &stmt) {
llvm::Function *function = getCurrentFunction();
const auto *decl = stmt.varDecl.get();
llvm::AllocaInst *var = allocateStackVariable(function, decl->identifier);
if (const auto &init = decl->initializer)
builder.CreateStore(generateExpr(*init), var);
declarations[decl] = var;
return nullptr;
}
Now variables can be created and initialized, but their values cannot be changed even if they are mutable. To address this issue, the language must support assignments too.
<assignment>
::= <declRefExpr> '=' <expr>
The grammar is constructed such that assignments are easy to
parse and analyze later. Since only declared values can be
assigned, the LHS is a DeclRefExpr
.
Design Note
In more complex languages the LHS of an assignment can be an arbitrary expression as even
a.b().c.d = 1;
can mean a valid assignment. In this case, it's the responsibility of the semantic analyzer to ensure that the LHS is indeed a variable that can be assigned a new value.
This grammar rule doesn't introduce any new tokens, so the lexer doesn't need to be modified before the assignment can be parsed.
The Assignment
node encapsulated the
DeclRefExpr
on the left-hand side and the
arbitrary expression on the right-hand side.
struct Assignment : public Stmt {
std::unique_ptr<DeclRefExpr> variable;
std::unique_ptr<Expr> expr;
Assignment(SourceLocation location,
std::unique_ptr<DeclRefExpr> variable,
std::unique_ptr<Expr> expr)
: Stmt(location),
variable(std::move(variable)),
expr(std::move(expr)) {}
void dump(size_t level = 0) const override;
};
The textual representation of the
Assignment
includes the assigned variable and
the expression whose result is being assigned.
void Assignment::dump(size_t level) const {
std::cerr << indent(level) << "Assignment:\n";
variable->dump(level + 1);
expr->dump(level + 1);
}
Since both an expression and an assignment can start with a
DeclRefExpr
on the LHS, a dedicated
parseAssignmentOrExpr()
method is introduced
that helps the parser determine which of the two statements
to parse. It is called in parseStmt()
when the
next token doesn't match any of the conditions.
std::unique_ptr<Stmt> Parser::parseStmt() {
...
return parseAssignmentOrExpr();
}
This method first parses a prefix expression and if it's not
followed by an =
token, it treats this prefix
expression as the LHS of a more complex expression and
proceeds accordingly. When the complex expression is parsed,
the ;
at the end is matched and the AST node is
returned.
std::unique_ptr<Stmt> Parser::parseAssignmentOrExpr() {
varOrReturn(lhs, parsePrefixExpr());
if (nextToken.kind != TokenKind::Equal) {
varOrReturn(expr, parseExprRHS(std::move(lhs), 0));
matchOrReturn(TokenKind::Semi, "expected ';' at the end of expression");
eatNextToken(); // eat ';'
return expr;
}
...
}
If the next token is =
, the parsed expression
must be the LHS of an assignment, so it is checked to be a
DeclRefExpr
.
std::unique_ptr<Stmt> Parser::parseAssignmentOrExpr() {
...
auto *dre = dynamic_cast<DeclRefExpr *>(lhs.get());
if (!dre)
return report(lhs->location,
"expected variable on the LHS of an assignment");
...
}
If the expression is successfully cast to a
DeclRefExpr
, it's treated as the LHS of an
assignment and handed over to
parseAssignmentRHS()
.
std::unique_ptr<Stmt> Parser::parseAssignmentOrExpr() {
...
std::ignore = lhs.release();
varOrReturn(assignment,
parseAssignmentRHS(std::unique_ptr<DeclRefExpr>(dre)));
...
}
Design Note
The general pattern to
dynamic_cast<>
astd::unique_ptr
without causing a memory leak is to first cast the raw pointer of the managed object, and then transfer the ownership to a newstd::unique_ptr
of the target type.The initial casting of the raw pointer ensures that the pointed object is of the correct type. If the cast fails, the
std::unique_ptr
still keeps its ownership over the pointed object.auto *dre = dynamic_cast<DeclRefExpr *>(lhs.get());
If the cast succeeds, the ownership of the object must be transferred to a new
std::unique_ptr
of the target type, which is done by first releasing the object from the old pointer and passing it to the new one.if (dre) { lhs.release(); std::unique_ptr<DeclRefExpr> ptr(dre); }
This way the object is always managed by a
std::unique_ptr
and is not going to cause a memory leak.The common mistake is to release the ownership of the object before casting.
auto *dre = dynamic_cast<DeclRefExpr *>(lhs.release()); std::unique_ptr<DeclRefExpr> ptr(dre);
This way the object is released from
lhs
before the cast is performed and if the cast fails, there is no way to get back the pointer to the object, so the object is lost in memory.
After the parsing of the assignment is successful, the
trailing ;
is matched and the node is returned.
std::unique_ptr<Stmt> Parser::parseAssignmentOrExpr() {
...
matchOrReturn(TokenKind::Semi, "expected ';' at the end of assignment");
eatNextToken(); // eat ';'
return assignment;
}
The parseAssignmentRHS()
method is similar to
parseExpressionRHS()
in that it also takes the
LHS as a parameter. It then proceeds by eating the
=
token, parsing the arbitrary expression on
the RHS and returning the assignment node.
std::unique_ptr<Assignment>
Parser::parseAssignmentRHS(std::unique_ptr<DeclRefExpr> lhs) {
SourceLocation location = nextToken.location;
eatNextToken(); // eat '='
varOrReturn(rhs, parseExpr());
return std::make_unique<Assignment>(location, std::move(lhs), std::move(rhs));
}
The resolved assignment node captures the resolved LHS and resolved RHS.
struct ResolvedAssignment : public ResolvedStmt {
std::unique_ptr<ResolvedDeclRefExpr> variable;
std::unique_ptr<ResolvedExpr> expr;
ResolvedAssignment(SourceLocation location,
std::unique_ptr<ResolvedDeclRefExpr> variable,
std::unique_ptr<ResolvedExpr> expr)
: ResolvedStmt(location),
variable(std::move(variable)),
expr(std::move(expr)) {}
void dump(size_t level = 0) const override;
};
Its textual representation is identical to its non-resolved counterpart.
void ResolvedAssignment::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedAssignment:\n";
variable->dump(level + 1);
expr->dump(level + 1);
}
The resolution of the statement is driven by the
resolveStmt()
method.
std::unique_ptr<ResolvedStmt> Sema::resolveStmt(const Stmt &stmt) {
...
if (auto *assignment = dynamic_cast<const Assignment *>(&stmt))
return resolveAssignment(*assignment);
...
}
First, the LHS and the RHS of the node is resolved.
std::unique_ptr<ResolvedAssignment>
Sema::resolveAssignment(const Assignment &assignment) {
varOrReturn(resolvedLHS, resolveDeclRefExpr(*assignment.variable));
varOrReturn(resolvedRHS, resolveExpr(*assignment.expr));
...
}
It's known that the referenced declaration on the LHS cannot
be void
, as only function declarations are
allowed to have a void
type, however, if the
LHS is a reference to a function declaration, it couldn't
have been resolved. Also, function parameters are immutable,
so assigning a value to any of them is an error.
std::unique_ptr<ResolvedAssignment>
Sema::resolveAssignment(const Assignment &assignment) {
...
if (dynamic_cast<const ResolvedParamDecl *>(resolvedLHS->decl))
return report(resolvedLHS->location,
"parameters are immutable and cannot be assigned");
...
}
If the LHS is not a parameter, it must be a variable. If the type of the variable doesn't match the type of the assigned value, an error is reported. Otherwise, the RHS is tried to be evaluated as a constant expression and the resolved node is returned.
std::unique_ptr<ResolvedAssignment>
Sema::resolveAssignment(const Assignment &assignment) {
...
auto *var = dynamic_cast<const ResolvedVarDecl *>(resolvedLHS->decl);
if (resolvedRHS->type.kind != resolvedLHS->type.kind)
return report(resolvedRHS->location,
"assigned value type doesn't match variable type");
resolvedRHS->setConstantValue(cee.evaluate(*resolvedRHS, false));
return std::make_unique<ResolvedAssignment>(
assignment.location, std::move(resolvedLHS), std::move(resolvedRHS));
}
Assignments need to be introduced to the CFG builder for flow-sensitive analysis to work properly.
int CFGBuilder::insertStmt(const ResolvedStmt &stmt, int block) {
...
if (auto *assignment = dynamic_cast<const ResolvedAssignment *>(&stmt))
return insertAssignment(*assignment, block);
...
}
Only the Assignment
itself and the assigned
expression are inserted into the CFG to avoid ambiguities.
int CFGBuilder::insertAssignment(const ResolvedAssignment &stmt, int block) {
cfg.insertStmt(&stmt, block);
return insertExpr(*stmt.expr, block);
}
Design Note
Consider the CFG of
x = x;
if theDeclRefExpr
on the LHS is also inserted.ResolvedDeclRefExpr: x ResolvedDeclRefExpr: x ResolvedAssignment: ...
By traversing the CFG both the LHS and the RHS look the same. This means that in an arbitrary situation, there is no way to know whether a
ResolvedDeclRefExpr
is reading the value or writing it without checking the next statement. This is also an issue in the Clang CFG, where both sides of an assignment are inserted like this.
As the IR generation of every other statement, the
generation of the assignment is also driven by
generateStmt()
.
llvm::Value *Codegen::generateStmt(const ResolvedStmt &stmt) {
...
if (auto *assignment = dynamic_cast<const ResolvedAssignment *>(&stmt))
return generateAssignment(*assignment);
...
}
An assignment is simply a store instruction that places the value of the RHS expression into the stack space of the referenced variable.
llvm::Value *Codegen::generateAssignment(const ResolvedAssignment &stmt) {
return builder.CreateStore(generateExpr(*stmt.expr),
declarations[stmt.variable->decl]);
}
Lazy initialization allows the programmer to declare a variable and initialize it later. This can come in handy if the value of an immutable variable depends on some other variable.
fn calculateAddress(base: number, offset: number, useLegacyMode: number): number {
let pointerSize: number;
if useLegacyMode {
pointerSize = 4;
} else {
pointerSize = 8;
}
...
}
This feature can however quickly lead to error-prone code if the compiler doesn't perform proper analysis to prevent it. For example, the value might not be initialized before it is used.
fn calculateAddress(base: number, offset: number, useLegacyMode: number): number {
let pointerSize: number;
if useLegacyMode {
pointerSize = 4;
}
// uninitialized if 'useLegacyMode' is false
println(pointerSize);
...
}
Another problem that can happen is the mutation of an immutable variable in a loop.
fn mutateImmutable(): void {
let lazy: number;
var i = 2;
while i > 0 {
// the first iteration initializes,
// the second mutates the value
lazy = i;
i = i - 1;
}
}
Intuitively if a variable is not initialized, the first time it appears on the left side of an assignment, it gets initialized, and the second time it gets mutated. Detecting the first and second assignments, however, is not a trivial task as every path of execution needs to be considered.
fn noMutation(): void {
let lazy: number;
// seemingly infinite loop
while 1 {
lazy = 1;
// exits after the first iteration, no mutation
return;
}
}
fn mutation(condition: number): void {
let lazy: number;
if condition {
lazy = 1;
}
// initialization if the condition is false
// mutation otherwise
lazy = 2;
}
Performing the analysis with regular AST traversal would prove to be difficult, as there are a lot of variables to keep in mind, like if the loop iterates more than once or branches of nested conditions. There is however a technique called data flow analysis, which can easily reason about this problem.
Data flow analysis is a technique that collects information about how certain values change over the execution of the program without actually executing it.
Its core is the traversal of the control flow graph of a function until the result of the analysis changes. More formally, data flow analysis is a fixpoint iteration over the CFG of a function.
The most important data structure in data flow analysis is
the Lattice
. A lattice is an ordered data
structure, that stores the information about the program in
each basic block. In the case of lazy initialization, it is
modelled as a mapping of a declaration to its assignment
state.
At a certain point in the program, a variable is either
Assigned
, Unassigned
, both of
them, or neither of them. A variable can be both assigned
and unassigned after an if statement, where only the true
branch assigns a value to it. This state is called
Top
in the mapping. The neither case happens
when a variable is not declared or not found in the lattice,
so the analyzer has no information about it. This state is
called Bottom
.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
enum class State { Bottom, Unassigned, Assigned, Top };
using Lattice = std::map<const ResolvedVarDecl *, State>;
...
}
Notice how the first value in State
is
Bottom
. If a state is not found in the map, one
will automatically be default constructed. The default
constructor of an enum yields its first value, so
Bottom
is returned.
At a certain point during the analysis, multiple lattices need to be combined (joined) with each other. In such cases, the states are combined in the following way.
State
has the same value, their
joined value is the same too. E.g.:
join(Assigned, Assigned) -> Assigned
State
is Bottom
, the
result is the other state. E.g.:
join(Bottom, Unassigned) -> Unassigned
Assigned
and the other is
Unassigned
, the result is Top
.
E.g.: joint(Assigned, Unassigned) -> Top
Top
, the result
is Top
too. E.g.:
join(Unassigned, Top) -> Top
As a result, after performing multiple joins over time, the
result either stays the same or reaches the
Top
state and then keeps staying
Top
. Formally the result reaches a fixpoint
over time.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
auto joinStates = [](State s1, State s2) {
if (s1 == s2)
return s1;
if (s1 == State::Bottom)
return s2;
if (s2 == State::Bottom)
return s1;
return State::Top;
};
...
}
Initially, every CFG block is assigned an empty lattice, which is filled in during the traversal.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
std::vector<Lattice> curLattices(cfg.basicBlocks.size());
...
}
Because the analysis iterates over the CFG multiple times, any reported error can appear more than once. To tackle this problem, instead of directly reporting an error, it is first put into a container and gets processed only after the analysis finishes.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
std::vector<std::pair<SourceLocation, std::string>> pendingErrors;
...
}
As mentioned before the core of the analysis is a fixpoint
iteration over the CFG. At the start of each iteration, the
changed
flag is set to false and the pending
errors are cleared.
Then each basic block is visited starting with the entry
block and a new temporary lattice is created for the current
block. After the statements of the block have been
processed, it's checked if the temporary block is different
from the previously known block, stored in
curLattices
.
If the block changed, the changed
flag is set
to true, so that one more iteration is performed and the
temporary block becomes the new known block.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
bool changed = true;
while (changed) {
changed = false;
pendingErrors.clear();
for (int bb = cfg.entry; bb != cfg.exit; --bb) {
Lattice tmp;
...
if (curLattices[bb] != tmp) {
curLattices[bb] = tmp;
changed = true;
}
}
}
...
}
After the creation of the temporary lattice, the lattices of the predecessor blocks are joined together in it. This way the information at the end of each predecessor block is transferred into the temporary lattice.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
for (int bb = cfg.entry; bb != cfg.exit; --bb) {
...
const auto &[preds, succs, stmts] = cfg.basicBlocks[bb];
for (auto &&pred : preds)
for (auto &&[decl, state] : curLattices[pred.first])
tmp[decl] = joinStates(tmp[decl], state);
...
}
...
}
Once the temporary lattice is prepared, the processing of statements in the current block can begin. Since the CFG was built from bottom to top, the statements need to be visited in reverse order.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
for (int bb = cfg.entry; bb != cfg.exit; --bb) {
...
for (auto it = stmts.rbegin(); it != stmts.rend(); ++it) {
const ResolvedStmt *stmt = *it;
...
}
...
}
...
}
Upon seeing a DeclStmt
, the variable is added
to the lattice and the state is set to
Assigned
or Unassigned
depending
on whether the variable is initialized or not.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
for (auto it = stmts.rbegin(); it != stmts.rend(); ++it) {
...
if (const auto *decl = dynamic_cast<const ResolvedDeclStmt *>(stmt)) {
tmp[decl->varDecl.get()] =
decl->varDecl->initializer ? State::Assigned : State::Unassigned;
continue;
}
...
}
...
}
In the case of an Assignment
if the LHS is an
immutable variable and the current state is not
Unassigned
, a pending error is created, as this
is the mutation of an immutable value. In every other case,
the state of the LHS is set to Assigned
.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
for (auto it = stmts.rbegin(); it != stmts.rend(); ++it) {
...
if (auto *assignment = dynamic_cast<const ResolvedAssignment *>(stmt)) {
const auto *var =
dynamic_cast<const ResolvedVarDecl *>(assignment->variable->decl);
if (!var->isMutable && tmp[var] != State::Unassigned) {
std::string msg = '\'' + var->identifier + "' cannot be mutated";
pendingErrors.emplace_back(assignment->location, std::move(msg));
}
tmp[var] = State::Assigned;
continue;
}
...
}
...
}
The third and last statement the analyzer cares about is the
DeclRefExpr
. This statement reads the value of
the variable, so if the referenced variable is not in the
Assigned
state, a pending error is created.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
for (auto it = stmts.rbegin(); it != stmts.rend(); ++it) {
...
if (const auto *dre = dynamic_cast<const ResolvedDeclRefExpr *>(stmt)) {
const auto *var = dynamic_cast<const ResolvedVarDecl *>(dre->decl);
if (var && tmp[var] != State::Assigned) {
std::string msg = '\'' + var->identifier + "' is not initialized";
pendingErrors.emplace_back(dre->location, std::move(msg));
}
continue;
}
}
...
}
Finally when the lattices no longer change, the pending errors are reported and the function returns if it found any errors.
bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
const CFG &cfg) {
...
for (auto &&[loc, msg] : pendingErrors)
report(loc, msg);
return !pendingErrors.empty();
}
Along with the other flow-sensitive check, this analysis is
also invoked in runFlowSensitiveChecks()
.
bool Sema::runFlowSensitiveChecks(const ResolvedFunctionDecl &fn) {
...
error |= checkVariableInitialization(fn, cfg);
...
};