Variables

User-defined variables allow the programmer to store and manipulate values in memory. They are also essential for some other constructs to work properly. Right now the language doesn't have any way to modify a value, so for example loops can only be infinitely running or not running at all.

Immutability

The language supports two kinds of variables, mutable and immutable. Immutable variables cannot be changed over time, once they are assigned a value, the value remains the same for the entire lifetime of the variable. This allows the compiler to do further optimization and reason about values better.

fn main(): void {
    let x = 3;
    
    ...

    // 'x' is immutable, the condition is always true
    if x < 4 {
        ....
    }
}

Design Note

Reasoning about immutable values is only possible if there really is no way to mutate them. For example, in C++ it is possible to have const variables, which means that the value of the variable cannot be changed, but the language also provides const_cast<>(), which allows the mutation of such values.
const int x = 3;

const int *ptr = &x;
*const_cast<int*>(ptr) = 2;

assert(ptr == &x && *ptr == 2);
Even though the result of the above snippet is undefined behaviour, the compiler cannot assume that the value never changes, as it's simply not true.

Variables

Immutable variables can be defined with the let keyword, while mutable variables are defined with the var keyword.

let immutable ...;
var mutable ...;

The statement above is called a declaration statement, as it's a statement that declares a variable.

<declStmt>
  ::= ('let' | 'var') <varDecl> ';'

In a variable declaration, the identifier is followed by a type annotation and an optional initializer. If an initializer is given, however, the type annotation is not necessary, because the compiler can figure out the type from the initializer.

let x: number; // type annotation necessary
var y = 1;     // type inferred from the initializer

In the grammar, this is described by the following rule.

<varDecl>
  ::= <identifier> (':' <type>)? ('=' <expr>)?

The previously seen grammar rules introduce the let and var keywords and the = token, which need to be added to the lexer.

enum class TokenKind : char {
  ...
  Equal,
  KwLet,
  KwVar,
  ...
};

const std::unordered_map<std::string_view, TokenKind> keywords = {
    ...,    
    {"let", TokenKind::KwLet}, {"var", TokenKind::KwVar}};

The = token changes the logic of getNextToken() too because so far it can only lex ==. It is modified such that if the current character is '=', it peeks the next one and if that is not an '=' the lexer returns the = token, otherwise, it returns ==.

Token Lexer::getNextToken() {
  ...

  if (currentChar == '=') {
    if (peekNextChar() != '=')
      return Token{tokenStartLocation, TokenKind::Equal};

    eatNextChar();
    return Token{tokenStartLocation, TokenKind::EqualEqual};
  }

  ...
}

The AST of a variable declaration encapsulates its identifier, the optional type annotation and initializer and whether the variable is mutable or not.

struct VarDecl : public Decl {
  std::optional<Type> type;
  std::unique_ptr<Expr> initializer;
  bool isMutable;

  VarDecl(SourceLocation location,
          std::string identifier,
          std::optional<Type> type,
          bool isMutable,
          std::unique_ptr<Expr> initializer = nullptr)
      : Decl(location, std::move(identifier)),
        type(std::move(type)),
        initializer(std::move(initializer)),
        isMutable(isMutable) {}

  void dump(size_t level = 0) const override;
};

The textual representation of the node contains its identifier and the type annotation along with the initializer if any of them are present.

void VarDecl::dump(size_t level) const {
  std::cerr << indent(level) << "VarDecl: " << identifier;
  if (type)
    std::cerr << ':' << type->name;
  std::cerr << '\n';

  if (initializer)
    initializer->dump(level + 1);
}

The AST of the declaration statement is just a wrapper around a VarDecl.

struct DeclStmt : public Stmt {
  std::unique_ptr<VarDecl> varDecl;

  DeclStmt(SourceLocation location, std::unique_ptr<VarDecl> varDecl)
      : Stmt(location),
        varDecl(std::move(varDecl)) {}

  void dump(size_t level = 0) const override;
};

Its textual representation also contains the wrapped VarDecl only.

void DeclStmt::dump(size_t level) const {
  std::cerr << indent(level) << "DeclStmt:\n";
  varDecl->dump(level + 1);
}

The parsing of the DeclStmt is driven by the parseStmt() method.

std::unique_ptr<Stmt> Parser::parseStmt() {
  ...

  if (nextToken.kind == TokenKind::KwLet || nextToken.kind == TokenKind::KwVar)
    return parseDeclStmt();

  ...
}

First, the current token is stored, because it is needed to figure out if the declared variable is mutable or not.

std::unique_ptr<DeclStmt> Parser::parseDeclStmt() {
  Token tok = nextToken;
  ...
}

The leading let or var keyword is eaten and it's checked if the upcoming token is an identifier before the VarDecl is parsed. The parseVarDecl() method needs to be provided with the information if it's inside a let statement.

std::unique_ptr<DeclStmt> Parser::parseDeclStmt() {
  ...
  eatNextToken(); // eat 'let' | 'var'

  matchOrReturn(TokenKind::Identifier, "expected identifier");
  varOrReturn(varDecl, parseVarDecl(tok.kind == TokenKind::KwLet));

  ...
}

After parsing the variable declaration, the trailing ; is matched and if everything is successful, a DeclStmt node is returned.

std::unique_ptr<DeclStmt> Parser::parseDeclStmt() {
  ...

  matchOrReturn(TokenKind::Semi, "expected ';' after declaration");
  eatNextToken(); // eat ';'

  return std::make_unique<DeclStmt>(tok.location, std::move(varDecl));
}

The parseVarDecl() method takes a flag as a parameter that shows if the variable is inside a let statement. Parsing the variable declaration is a bit more complex since both the type annotation and the initializer are optional. It is known however that the first token is an identifier.

std::unique_ptr<VarDecl> Parser::parseVarDecl(bool isLet) {
  SourceLocation location = nextToken.location;

  std::string identifier = *nextToken.value;
  eatNextToken(); // eat identifier

  ...
}

Then the parser checks if the identifier is followed by a :, which indicates a type annotation. If this is the case, the type annotation is parsed.

std::unique_ptr<VarDecl> Parser::parseVarDecl(bool isLet) {
  ...

  std::optional<Type> type;
  if (nextToken.kind == TokenKind::Colon) {
    eatNextToken(); // eat ':'

    type = parseType();
    if (!type)
      return nullptr;
  }

  ...
}

Once the type is handled, the parser checks if there is an = token, which indicates an initializer. If it doesn't find such a token, it returns the AST node, otherwise, it parses the initializer and returns a node including that too.

std::unique_ptr<VarDecl> Parser::parseVarDecl(bool isLet) {
  ...

  if (nextToken.kind != TokenKind::Equal)
    return std::make_unique<VarDecl>(location, identifier, type, !isLet);
  eatNextToken(); // eat '='

  varOrReturn(initializer, parseExpr());

  return std::make_unique<VarDecl>(location, identifier, type, !isLet,
                                   std::move(initializer));
}

The resolved nodes are identical to their non-resolved counterparts, except for containing other resolved nodes.

The ResolvedVarDecl node no longer stores an optional type, as it must have its type set by the semantic analyzer.

struct ResolvedVarDecl : public ResolvedDecl {
  std::unique_ptr<ResolvedExpr> initializer;
  bool isMutable;

  ResolvedVarDecl(SourceLocation location,
                  std::string identifier,
                  Type type,
                  bool isMutable,
                  std::unique_ptr<ResolvedExpr> initializer = nullptr)
      : ResolvedDecl(location, std::move(identifier), type),
        initializer(std::move(initializer)),
        isMutable(isMutable) {}

  void dump(size_t level = 0) const override;
};

Since it's a declaration, its textual representation contains its address as well as its identifier and optional initializer.

void ResolvedVarDecl::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedVarDecl: @(" << this << ") "
            << identifier << ':' << '\n';
  if (initializer)
    initializer->dump(level + 1);
}

The resolved node of a DeclStmt is also just a wrapper around a VarDecl as its AST node.

struct ResolvedDeclStmt : public ResolvedStmt {
  std::unique_ptr<ResolvedVarDecl> varDecl;

  ResolvedDeclStmt(SourceLocation location,
                   std::unique_ptr<ResolvedVarDecl> varDecl)
      : ResolvedStmt(location),
        varDecl(std::move(varDecl)) {}

  void dump(size_t level = 0) const override;
};

Their textual representations are also identical.

void ResolvedDeclStmt::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedDeclStmt:\n";
  varDecl->dump(level + 1);
}

The resolution of the DeclStmt is driven by resolveStmt().

std::unique_ptr<ResolvedStmt> Sema::resolveStmt(const Stmt &stmt) {
  ...

  if (auto *declStmt = dynamic_cast<const DeclStmt *>(&stmt))
    return resolveDeclStmt(*declStmt);

  ...
}

First, the VarDecl is resolved, and if the resolution is successful, the variable is inserted into the current scope.

std::unique_ptr<ResolvedDeclStmt>
Sema::resolveDeclStmt(const DeclStmt &declStmt) {
  varOrReturn(resolvedVarDecl, resolveVarDecl(*declStmt.varDecl));

  if (!insertDeclToCurrentScope(*resolvedVarDecl))
    return nullptr;

  return std::make_unique<ResolvedDeclStmt>(declStmt.location,
                                            std::move(resolvedVarDecl));
}

A VarDecl must either have a type annotation or an initializer. If both of them are missing, an error is reported.

std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
  if (!varDecl.type && !varDecl.initializer)
    return report(
        varDecl.location,
        "an uninitialized variable is expected to have a type specifier");

  ...
}

To be able to infer the type, if there is an initializer it has to be resolved first.

std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
  ...

  std::unique_ptr<ResolvedExpr> resolvedInitializer = nullptr;
  if (varDecl.initializer) {
    resolvedInitializer = resolveExpr(*varDecl.initializer);
    if (!resolvedInitializer)
      return nullptr;
  }

  ...
}

The type that needs to be resolved is either the type of the variable or if it's not specified, the type of the initializer.

std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
  ...

  Type resolvableType = varDecl.type.value_or(resolvedInitializer->type);
  std::optional<Type> type = resolveType(resolvableType);

  ...
}

If the type cannot be resolved, or it is void an error is reported to the user.

std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
  ...

  if (!type || type->kind == Type::Kind::Void)
    return report(varDecl.location, "variable '" + varDecl.identifier +
                                        "' has invalid '" +
                                        resolvableType.name + "' type");

  ...
}

Finally, the case is handled when there is both an initializer and a type specifier, but they are of different types. The initializer is also evaluated by the constant expression evaluator.

std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
  ...

  if (resolvedInitializer) {
    if (resolvedInitializer->type.kind != type->kind)
      return report(resolvedInitializer->location, "initializer type mismatch");

    resolvedInitializer->setConstantValue(cee.evaluate(*resolvedInitializer, false));
  }

  ...
}

If everything is successful, the resolved node is returned.

std::unique_ptr<ResolvedVarDecl> Sema::resolveVarDecl(const VarDecl &varDecl) {
  ...

  return std::make_unique<ResolvedVarDecl>(varDecl.location, varDecl.identifier,
                                           *type, varDecl.isMutable,
                                           std::move(resolvedInitializer));
}

With the introduction of variables, the components responsible for code analysis can also be extended. In the tree walk interpreter, it now makes sense to handle DeclRefExpr too. Before the introduction of variables, this expression could only point to parameters and functions, but none of those can be evaluated in compile time.

std::optional<double>
ConstantExpressionEvaluator::evaluate(const ResolvedExpr &expr,
                                      bool allowSideEffects) {
  ...

  if (const auto *declRefExpr =
          dynamic_cast<const ResolvedDeclRefExpr *>(&expr))
    return evaluateDeclRefExpr(*declRefExpr, allowSideEffects);

  ...
}

If a variable is immutable and initialized upon declaration, its value is set by the initializer and doesn't change over time. Because of this, the value of the variable is known in compile time.

std::optional<double> 
ConstantExpressionEvaluator::evaluateDeclRefExpr(const ResolvedDeclRefExpr &dre,
                                                 bool allowSideEffects) {
  const auto *rvd = dynamic_cast<const ResolvedVarDecl *>(dre.decl);
  if (!rvd || rvd->isMutable || !rvd->initializer)
    return std::nullopt;

  return evaluate(*rvd->initializer, allowSideEffects);
}

The control flow graph can also be extended with the new variable declarations, which extension becomes necessary soon.

int CFGBuilder::insertStmt(const ResolvedStmt &stmt, int block) {
  ...

  if (auto *declStmt = dynamic_cast<const ResolvedDeclStmt *>(&stmt))
    return insertDeclStmt(*declStmt, block);

  ...
}

Because the CFG is constructed in reverse order, from bottom to top, first the DeclStmt is inserted, then the initializer if there's one.

int CFGBuilder::insertDeclStmt(const ResolvedDeclStmt &stmt, int block) {
  cfg.insertStmt(&stmt, block);

  if (const auto &init = stmt.varDecl->initializer)
    return insertExpr(*init, block);

  return block;
}

The LLVM IR generation is driven by generateStmt().

llvm::Value *Codegen::generateStmt(const ResolvedStmt &stmt) {
  ...

  if (auto *declStmt = dynamic_cast<const ResolvedDeclStmt *>(&stmt))
    return generateDeclStmt(*declStmt);

  ...
}

When a variable is declared, first some space on the stack has to be allocated for it. If it's initialized, the value of the initializer is stored in the variable's stack space. Finally, the stack space of the variable is mapped to its declaration.

llvm::Value *Codegen::generateDeclStmt(const ResolvedDeclStmt &stmt) {
  llvm::Function *function = getCurrentFunction();
  const auto *decl = stmt.varDecl.get();

  llvm::AllocaInst *var = allocateStackVariable(function, decl->identifier);

  if (const auto &init = decl->initializer)
    builder.CreateStore(generateExpr(*init), var);

  declarations[decl] = var;
  return nullptr;
}

Assignment

Now variables can be created and initialized, but their values cannot be changed even if they are mutable. To address this issue, the language must support assignments too.

<assignment>
  ::= <declRefExpr> '=' <expr>

The grammar is constructed such that assignments are easy to parse and analyze later. Since only declared values can be assigned, the LHS is a DeclRefExpr.

Design Note

In more complex languages the LHS of an assignment can be an arbitrary expression as even a.b().c.d = 1; can mean a valid assignment. In this case, it's the responsibility of the semantic analyzer to ensure that the LHS is indeed a variable that can be assigned a new value.

This grammar rule doesn't introduce any new tokens, so the lexer doesn't need to be modified before the assignment can be parsed.

The Assignment node encapsulated the DeclRefExpr on the left-hand side and the arbitrary expression on the right-hand side.

struct Assignment : public Stmt {
  std::unique_ptr<DeclRefExpr> variable;
  std::unique_ptr<Expr> expr;

  Assignment(SourceLocation location,
             std::unique_ptr<DeclRefExpr> variable,
             std::unique_ptr<Expr> expr)
      : Stmt(location),
        variable(std::move(variable)),
        expr(std::move(expr)) {}

  void dump(size_t level = 0) const override;
};

The textual representation of the Assignment includes the assigned variable and the expression whose result is being assigned.

void Assignment::dump(size_t level) const {
  std::cerr << indent(level) << "Assignment:\n";
  variable->dump(level + 1);
  expr->dump(level + 1);
}

Since both an expression and an assignment can start with a DeclRefExpr on the LHS, a dedicated parseAssignmentOrExpr() method is introduced that helps the parser determine which of the two statements to parse. It is called in parseStmt() when the next token doesn't match any of the conditions.

std::unique_ptr<Stmt> Parser::parseStmt() {
  ...

  return parseAssignmentOrExpr();
}

This method first parses a prefix expression and if it's not followed by an = token, it treats this prefix expression as the LHS of a more complex expression and proceeds accordingly. When the complex expression is parsed, the ; at the end is matched and the AST node is returned.

std::unique_ptr<Stmt> Parser::parseAssignmentOrExpr() {
  varOrReturn(lhs, parsePrefixExpr());

  if (nextToken.kind != TokenKind::Equal) {
    varOrReturn(expr, parseExprRHS(std::move(lhs), 0));

    matchOrReturn(TokenKind::Semi, "expected ';' at the end of expression");
    eatNextToken(); // eat ';'

    return expr;
  }

  ...
}

If the next token is =, the parsed expression must be the LHS of an assignment, so it is checked to be a DeclRefExpr.

std::unique_ptr<Stmt> Parser::parseAssignmentOrExpr() {
  ...

  auto *dre = dynamic_cast<DeclRefExpr *>(lhs.get());
  if (!dre)
    return report(lhs->location,
                  "expected variable on the LHS of an assignment");

  ...
}

If the expression is successfully cast to a DeclRefExpr, it's treated as the LHS of an assignment and handed over to parseAssignmentRHS().

std::unique_ptr<Stmt> Parser::parseAssignmentOrExpr() {
  ...
  std::ignore = lhs.release();

  varOrReturn(assignment,
              parseAssignmentRHS(std::unique_ptr<DeclRefExpr>(dre)));

  ...
}

Design Note

The general pattern to dynamic_cast<> a std::unique_ptr without causing a memory leak is to first cast the raw pointer of the managed object, and then transfer the ownership to a new std::unique_ptr of the target type.

The initial casting of the raw pointer ensures that the pointed object is of the correct type. If the cast fails, the std::unique_ptr still keeps its ownership over the pointed object.
auto *dre = dynamic_cast<DeclRefExpr *>(lhs.get());
If the cast succeeds, the ownership of the object must be transferred to a new std::unique_ptr of the target type, which is done by first releasing the object from the old pointer and passing it to the new one.
if (dre) {
  lhs.release();
  std::unique_ptr<DeclRefExpr> ptr(dre);
}
This way the object is always managed by a std::unique_ptr and is not going to cause a memory leak.

The common mistake is to release the ownership of the object before casting.
auto *dre = dynamic_cast<DeclRefExpr *>(lhs.release());
std::unique_ptr<DeclRefExpr> ptr(dre);
This way the object is released from lhs before the cast is performed and if the cast fails, there is no way to get back the pointer to the object, so the object is lost in memory.

After the parsing of the assignment is successful, the trailing ; is matched and the node is returned.

std::unique_ptr<Stmt> Parser::parseAssignmentOrExpr() {
  ...

  matchOrReturn(TokenKind::Semi, "expected ';' at the end of assignment");
  eatNextToken(); // eat ';'

  return assignment;
}

The parseAssignmentRHS() method is similar to parseExpressionRHS() in that it also takes the LHS as a parameter. It then proceeds by eating the = token, parsing the arbitrary expression on the RHS and returning the assignment node.

std::unique_ptr<Assignment>
Parser::parseAssignmentRHS(std::unique_ptr<DeclRefExpr> lhs) {
  SourceLocation location = nextToken.location;
  eatNextToken(); // eat '='

  varOrReturn(rhs, parseExpr());

  return std::make_unique<Assignment>(location, std::move(lhs), std::move(rhs));
}

The resolved assignment node captures the resolved LHS and resolved RHS.

struct ResolvedAssignment : public ResolvedStmt {
  std::unique_ptr<ResolvedDeclRefExpr> variable;
  std::unique_ptr<ResolvedExpr> expr;

  ResolvedAssignment(SourceLocation location,
                     std::unique_ptr<ResolvedDeclRefExpr> variable,
                     std::unique_ptr<ResolvedExpr> expr)
      : ResolvedStmt(location),
        variable(std::move(variable)),
        expr(std::move(expr)) {}

  void dump(size_t level = 0) const override;
};

Its textual representation is identical to its non-resolved counterpart.

void ResolvedAssignment::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedAssignment:\n";
  variable->dump(level + 1);
  expr->dump(level + 1);
}

The resolution of the statement is driven by the resolveStmt() method.

std::unique_ptr<ResolvedStmt> Sema::resolveStmt(const Stmt &stmt) {
  ...

  if (auto *assignment = dynamic_cast<const Assignment *>(&stmt))
    return resolveAssignment(*assignment);

  ...
}

First, the LHS and the RHS of the node is resolved.

std::unique_ptr<ResolvedAssignment>
Sema::resolveAssignment(const Assignment &assignment) {
  varOrReturn(resolvedLHS, resolveDeclRefExpr(*assignment.variable));
  varOrReturn(resolvedRHS, resolveExpr(*assignment.expr));

  ...
}

It's known that the referenced declaration on the LHS cannot be void, as only function declarations are allowed to have a void type, however, if the LHS is a reference to a function declaration, it couldn't have been resolved. Also, function parameters are immutable, so assigning a value to any of them is an error.

std::unique_ptr<ResolvedAssignment>
Sema::resolveAssignment(const Assignment &assignment) {
  ...

  if (dynamic_cast<const ResolvedParamDecl *>(resolvedLHS->decl))
    return report(resolvedLHS->location,
                  "parameters are immutable and cannot be assigned");

  ...
}

If the LHS is not a parameter, it must be a variable. If the type of the variable doesn't match the type of the assigned value, an error is reported. Otherwise, the RHS is tried to be evaluated as a constant expression and the resolved node is returned.

std::unique_ptr<ResolvedAssignment>
  Sema::resolveAssignment(const Assignment &assignment) {
    ...
  
    auto *var = dynamic_cast<const ResolvedVarDecl *>(resolvedLHS->decl);
  
    if (resolvedRHS->type.kind != resolvedLHS->type.kind)
      return report(resolvedRHS->location,
                    "assigned value type doesn't match variable type");
  
    resolvedRHS->setConstantValue(cee.evaluate(*resolvedRHS, false));
  
    return std::make_unique<ResolvedAssignment>(
        assignment.location, std::move(resolvedLHS), std::move(resolvedRHS));
  }

Assignments need to be introduced to the CFG builder for flow-sensitive analysis to work properly.

int CFGBuilder::insertStmt(const ResolvedStmt &stmt, int block) {
  ...

  if (auto *assignment = dynamic_cast<const ResolvedAssignment *>(&stmt))
    return insertAssignment(*assignment, block);

  ...
}

Only the Assignment itself and the assigned expression are inserted into the CFG to avoid ambiguities.

int CFGBuilder::insertAssignment(const ResolvedAssignment &stmt, int block) {
  cfg.insertStmt(&stmt, block);
  return insertExpr(*stmt.expr, block);
}

Design Note

Consider the CFG of x = x; if the DeclRefExpr on the LHS is also inserted.
ResolvedDeclRefExpr: x
ResolvedDeclRefExpr: x
ResolvedAssignment: ...
By traversing the CFG both the LHS and the RHS look the same. This means that in an arbitrary situation, there is no way to know whether a ResolvedDeclRefExpr is reading the value or writing it without checking the next statement. This is also an issue in the Clang CFG, where both sides of an assignment are inserted like this.

As the IR generation of every other statement, the generation of the assignment is also driven by generateStmt().

llvm::Value *Codegen::generateStmt(const ResolvedStmt &stmt) {
  ...

  if (auto *assignment = dynamic_cast<const ResolvedAssignment *>(&stmt))
    return generateAssignment(*assignment);

  ...
}

An assignment is simply a store instruction that places the value of the RHS expression into the stack space of the referenced variable.

llvm::Value *Codegen::generateAssignment(const ResolvedAssignment &stmt) {
  return builder.CreateStore(generateExpr(*stmt.expr),
                             declarations[stmt.variable->decl]);
}

Lazy Initialization

Lazy initialization allows the programmer to declare a variable and initialize it later. This can come in handy if the value of an immutable variable depends on some other variable.

fn calculateAddress(base: number, offset: number, useLegacyMode: number): number {
    let pointerSize: number;

    if useLegacyMode {
        pointerSize = 4;
    } else {
        pointerSize = 8;
    }

    ...
}

This feature can however quickly lead to error-prone code if the compiler doesn't perform proper analysis to prevent it. For example, the value might not be initialized before it is used.

fn calculateAddress(base: number, offset: number, useLegacyMode: number): number {
    let pointerSize: number;

    if useLegacyMode {
        pointerSize = 4;
    }

    // uninitialized if 'useLegacyMode' is false
    println(pointerSize); 

    ...
}

Another problem that can happen is the mutation of an immutable variable in a loop.

fn mutateImmutable(): void {
    let lazy: number;
    
    var i = 2;
    while i > 0 {
        // the first iteration initializes,
        // the second mutates the value
        lazy = i;

        i = i - 1;
    }
}

Intuitively if a variable is not initialized, the first time it appears on the left side of an assignment, it gets initialized, and the second time it gets mutated. Detecting the first and second assignments, however, is not a trivial task as every path of execution needs to be considered.

fn noMutation(): void {
    let lazy: number;
    
    // seemingly infinite loop
    while 1 {
        lazy = 1;

        // exits after the first iteration, no mutation
        return;
    }
}

fn mutation(condition: number): void {
    let lazy: number;
    
    if condition {
        lazy = 1;
    }
    
    // initialization if the condition is false
    // mutation otherwise
    lazy = 2;
}

Performing the analysis with regular AST traversal would prove to be difficult, as there are a lot of variables to keep in mind, like if the loop iterates more than once or branches of nested conditions. There is however a technique called data flow analysis, which can easily reason about this problem.

Data Flow Analysis

Data flow analysis is a technique that collects information about how certain values change over the execution of the program without actually executing it.

Its core is the traversal of the control flow graph of a function until the result of the analysis changes. More formally, data flow analysis is a fixpoint iteration over the CFG of a function.

The most important data structure in data flow analysis is the Lattice. A lattice is an ordered data structure, that stores the information about the program in each basic block. In the case of lazy initialization, it is modelled as a mapping of a declaration to its assignment state.

At a certain point in the program, a variable is either Assigned, Unassigned, both of them, or neither of them. A variable can be both assigned and unassigned after an if statement, where only the true branch assigns a value to it. This state is called Top in the mapping. The neither case happens when a variable is not declared or not found in the lattice, so the analyzer has no information about it. This state is called Bottom.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  enum class State { Bottom, Unassigned, Assigned, Top };

  using Lattice = std::map<const ResolvedVarDecl *, State>;
  ...
}

Notice how the first value in State is Bottom. If a state is not found in the map, one will automatically be default constructed. The default constructor of an enum yields its first value, so Bottom is returned.

At a certain point during the analysis, multiple lattices need to be combined (joined) with each other. In such cases, the states are combined in the following way.

If both State has the same value, their joined value is the same too. E.g.: join(Assigned, Assigned) -> Assigned
If either State is Bottom, the result is the other state. E.g.: join(Bottom, Unassigned) -> Unassigned
If one case is Assigned and the other is Unassigned, the result is Top. E.g.: joint(Assigned, Unassigned) -> Top
If either of the states is Top, the result is Top too. E.g.: join(Unassigned, Top) -> Top

As a result, after performing multiple joins over time, the result either stays the same or reaches the Top state and then keeps staying Top. Formally the result reaches a fixpoint over time.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...

  auto joinStates = [](State s1, State s2) {
    if (s1 == s2)
      return s1;

    if (s1 == State::Bottom)
      return s2;

    if (s2 == State::Bottom)
      return s1;

    return State::Top;
  };

  ...
}

Initially, every CFG block is assigned an empty lattice, which is filled in during the traversal.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...
  std::vector<Lattice> curLattices(cfg.basicBlocks.size());
  ...
}

Because the analysis iterates over the CFG multiple times, any reported error can appear more than once. To tackle this problem, instead of directly reporting an error, it is first put into a container and gets processed only after the analysis finishes.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...
  std::vector<std::pair<SourceLocation, std::string>> pendingErrors;
  ...
}

As mentioned before the core of the analysis is a fixpoint iteration over the CFG. At the start of each iteration, the changed flag is set to false and the pending errors are cleared.

Then each basic block is visited starting with the entry block and a new temporary lattice is created for the current block. After the statements of the block have been processed, it's checked if the temporary block is different from the previously known block, stored in curLattices.

If the block changed, the changed flag is set to true, so that one more iteration is performed and the temporary block becomes the new known block.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...

  bool changed = true;
  while (changed) {
    changed = false;
    pendingErrors.clear();

    for (int bb = cfg.entry; bb != cfg.exit; --bb) {
      Lattice tmp;
      
      ...

      if (curLattices[bb] != tmp) {
        curLattices[bb] = tmp;
        changed = true;
      }
    }
  }

  ...
}

After the creation of the temporary lattice, the lattices of the predecessor blocks are joined together in it. This way the information at the end of each predecessor block is transferred into the temporary lattice.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...
    for (int bb = cfg.entry; bb != cfg.exit; --bb) {
      ...
      
      const auto &[preds, succs, stmts] = cfg.basicBlocks[bb];

      for (auto &&pred : preds)
        for (auto &&[decl, state] : curLattices[pred.first])
          tmp[decl] = joinStates(tmp[decl], state);

      ...
    }
    ...
}

Once the temporary lattice is prepared, the processing of statements in the current block can begin. Since the CFG was built from bottom to top, the statements need to be visited in reverse order.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...
    for (int bb = cfg.entry; bb != cfg.exit; --bb) {
      ...

      for (auto it = stmts.rbegin(); it != stmts.rend(); ++it) {
        const ResolvedStmt *stmt = *it;

        ...
      }

      ...
    }
  ...
}

Upon seeing a DeclStmt, the variable is added to the lattice and the state is set to Assigned or Unassigned depending on whether the variable is initialized or not.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...
    for (auto it = stmts.rbegin(); it != stmts.rend(); ++it) {
      ...

      if (const auto *decl = dynamic_cast<const ResolvedDeclStmt *>(stmt)) {
        tmp[decl->varDecl.get()] =
            decl->varDecl->initializer ? State::Assigned : State::Unassigned;
        continue;
      }

      ...
    }
  ...
}

In the case of an Assignment if the LHS is an immutable variable and the current state is not Unassigned, a pending error is created, as this is the mutation of an immutable value. In every other case, the state of the LHS is set to Assigned.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...
    for (auto it = stmts.rbegin(); it != stmts.rend(); ++it) {
      ...

      if (auto *assignment = dynamic_cast<const ResolvedAssignment *>(stmt)) {
        const auto *var =
            dynamic_cast<const ResolvedVarDecl *>(assignment->variable->decl);

        if (!var->isMutable && tmp[var] != State::Unassigned) {
          std::string msg = '\'' + var->identifier + "' cannot be mutated";
          pendingErrors.emplace_back(assignment->location, std::move(msg));
        }

        tmp[var] = State::Assigned;
        continue;
      }

      ...
    }
  ...
}

The third and last statement the analyzer cares about is the DeclRefExpr. This statement reads the value of the variable, so if the referenced variable is not in the Assigned state, a pending error is created.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...
      for (auto it = stmts.rbegin(); it != stmts.rend(); ++it) {
        ...
        
        if (const auto *dre = dynamic_cast<const ResolvedDeclRefExpr *>(stmt)) {
          const auto *var = dynamic_cast<const ResolvedVarDecl *>(dre->decl);

          if (var && tmp[var] != State::Assigned) {
            std::string msg = '\'' + var->identifier + "' is not initialized";
            pendingErrors.emplace_back(dre->location, std::move(msg));
          }

          continue;
        }
      }
  ...
}

Finally when the lattices no longer change, the pending errors are reported and the function returns if it found any errors.

bool Sema::checkVariableInitialization(const ResolvedFunctionDecl &fn,
                                       const CFG &cfg) {
  ...

  for (auto &&[loc, msg] : pendingErrors)
    report(loc, msg);

  return !pendingErrors.empty();
}

Along with the other flow-sensitive check, this analysis is also invoked in runFlowSensitiveChecks().

bool Sema::runFlowSensitiveChecks(const ResolvedFunctionDecl &fn) {
  ...
  error |= checkVariableInitialization(fn, cfg);

  ...
};