Handling Arithmetics

To be able to handle basic arithmetics, operators like +, -, * and / need to be introduced. These operators also have a precedence level, which the parser needs to be aware of. It matters whether the expression 2 + 3 * 4 is parsed as (2 + 3) * 4 or 2 + (3 * 4). An additional primary expression, the grouping expression ( expression ) is also required to allow the programmer to override the original precedence.

Operators are expressed in the grammar such that a new non-terminal is introduced for every precedence level and each of these non-terminals expands to the next precedence level in an increasing order from lowest to highest.

    Precedence    |       Type       |     Symbols
───────────────────────────────────────────────────────
      Highest     │  Multiplicative  │       *,/   
      Lowest      │     Additive     |       +,-
<expr>
    ::= <additiveExpression>

<additiveExpression>
    ::= <multiplicativeExpression> (('+' | '-') <multiplicativeExpression>)*

<multiplicativeExpression>
    ::= <primaryExpr> (('*' | '/') <primaryExpr>)*

The purpose of putting the lowest precedence expression to the top and the highest to the bottom is that it's always the left-most non-terminal expanded first, so after the first primaryExpression is parsed, the parser will immediately look for the highest precedence operator first and as it backtracks, it checks the lowest one last.

Expression: 2 + 3

1.) go down the stack, parse the primary
|  parseExpr()                                   tok: 2
|    <mul> (('+' | '-') <mul>)*
|  
|    parseMul()
|      <pri> (('*' | '/') <pri>)*    
|    
|      parsePri()                                eat: 2
|        '2'                                     tok: +

2.) go up the stack, match the operator
|      return from parsePri                
|      
|      '2'   (('*' | '/') <pri>)*
|    return from parseMul
|  
|    '2'   (('+' | '-') <mul>)*                  eat: +
|  ...         

Operator Precedence Parsing

There are however a few drawbacks of using a recursive descent parser to parse operators. In production used languages there are more than 2 precedence levels. In Kotlin for example, there are 15. Creating and backtracking 15 stack frames for every operator has a significant impact on performance hence a different approach is used.

Other languages, like Haskell support user-defined operators with custom precedence level and associativity. In a recursive descent parser implementing such a feature is tricky, as it requires adding new functions during runtime and modifying which functions are being called next.

Operator precedence parsing is an algorithm that can address both of the above drawbacks. Also, this parsing technique can easily be combined with recursive descent.

Before implementing this parser, however, the lexer needs to be extended with the new operators.

constexpr char singleCharTokens[] = {..., '+', '-', '*'};

enum class TokenKind : char {
  ...
  Slash,

  ...

  Plus = singleCharTokens[8],
  Minus = singleCharTokens[9],
  Asterisk = singleCharTokens[10]
};

Since comments begin with // the lexer now needs to be able to differentiate between / and //.

Token Lexer::getNextToken() {
  ...

  if (currentChar == '/') {
    if (peekNextChar() != '/')
      return Token{tokenStartLocation, TokenKind::Slash};

    ...
  }

  ...
}

In the AST each of these operators is represented by one single node, which stores their left and right side and the kind of the token that represents the operator.

struct BinaryOperator : public Expr {
  std::unique_ptr<Expr> lhs;
  std::unique_ptr<Expr> rhs;
  TokenKind op;

  BinaryOperator(SourceLocation location,
                 std::unique_ptr<Expr> lhs,
                 std::unique_ptr<Expr> rhs,
                 TokenKind op)
      : Expr(location),
        lhs(std::move(lhs)),
        rhs(std::move(rhs)),
        op(op) {}

  void dump(size_t level = 0) const override;
};

Design Note

A more elegant representation for operators is creating a dedicated enum of the possible kinds instead of reusing the tokens, but for now, the compiler is kept as simple as possible.

The textual representation of BinaryOperator contains its operands and the stored operator.

void BinaryOperator::dump(size_t level) const {
  std::cerr << indent(level) << "BinaryOperator: '" << getOpStr(op) << '\''
            << '\n';

  lhs->dump(level + 1);
  rhs->dump(level + 1);
}

To print the correct operator, a dedicated helper is created that maps the operator tokens to their textual representations.

std::string_view getOpStr(TokenKind op) {
  if (op == TokenKind::Plus)
    return "+";
  if (op == TokenKind::Minus)
    return "-";
  if (op == TokenKind::Asterisk)
    return "*";
  if (op == TokenKind::Slash)
    return "/";

  llvm_unreachable("unexpected operator");
}

The idea is to parse the LHS of an operator first and then parse each of the following operators together with their RHS. In the case of 2 + 3 * 4, it means parsing 2, then + 3 and finally * 4.

When an operator and its RHS are parsed, it needs to be decided whether that RHS is really the RHS of that operator or the LHS of the upcoming one. This is done by checking whether the precedence of the upcoming operator is higher than the precedence of the current one.

For example in 2 + 3 * 4 the RHS of + is 3. However keeping 3 as the RHS here would mean, that the expression is parsed as (2 + 3) * 4. Instead, it needs to be considered as the LHS of the upcoming *, because * has a higher precedence than +, so that the expression is parsed correctly as 2 + (3 * 4).

┌─────────────────────────────────────────────────────┐
|         Operator         |        Precedence        |
├─────────────────────────────────────────────────────┤
|           *, /           |             6            |
|           +, -           |             5            |
└─────────────────────────────────────────────────────┘
Expression: 2 + 3 * 4

1.) parse LHS outside the operator precedence parser

  2 + 3 * 4 ;
  ^

2.) invoke the parser with LHS being 2 and parse the 
    next OP and its RHS

  OP:   (precedence: 0) LHS: 2 RHS:  
  2 + 3 * 4 ;
    ^

  OP: + (precedence: 5) LHS: 2 RHS:  
  2 + 3 * 4 ;
      ^

3.) check the precedence of the upcoming OP 

  OP: + (precedence: 5) LHS: 2 RHS: 3
  2 + 3 * 4 ;
        ^ '*' has a higher precedence (6) than '+' (5)

     4.) halt the parsing of '+' and invoke the parser 
         recursively with LHS set to 3 and the current
         precedence level increased by one

       OP:   (precedence: 6) LHS: 3 RHS:  
       2 + 3 * 4 ;
             ^

     5.) parse the next OP and it's RHS

       OP: * (precedence: 6) LHS: 3 RHS:  
       2 + 3 * 4 ;
               ^
       
       OP: * (precedence: 6) LHS: 3 RHS: 4  
       2 + 3 * 4 ;
                 ^ return BinOp('*', 3, 4)

6.) resume the parsing of '+' and make the returned 
    BinOp it's rhs

  OP: + (precedence: 5) LHS: 2 RHS: 3 * 4
  2 + 3 * 4 ;
            ^ return BinOp('+', 2, BinOp('*', 3, 4))

To implement this logic, first, a helper is defined that returns the token precedence.

int getTokPrecedence(TokenKind tok) {
  switch (tok) {
  case TokenKind::Asterisk:
  case TokenKind::Slash:
    return 6;
  case TokenKind::Plus:
  case TokenKind::Minus:
    return 5;
  default:
    return -1;
  }
}

Design Note

The precedence table is hardcoded right now to make retrieving the precedence as quick as possible.

If a language however supports user-defined operators, the precedence table needs to be modified during the parsing of the file. In such cases, an std::unordered_map or similar data structure might be used.

The parseExpr() function houses the driver of the operator precedence parser. It first parses the initial primary expression and then parses the RHS for it. The parseExprRHS() houses the core logic of the operator precedence parser and as a result, it needs to know about the precedence of the current operator. Since initially there is no operator, it's given 0 as the precedence.

std::unique_ptr<Expr> Parser::parseExpr() {
  varOrReturn(lhs, parsePrimary());
  return parseExprRHS(std::move(lhs), 0);
}

The core of the parser is a loop, that keeps eating the operators and their RHS expressions until it finds an operator with a precedence less than the initial precedence.

std::unique_ptr<Expr> Parser::parseExprRHS(std::unique_ptr<Expr> lhs,
                                           int precedence) {
  while (true) {
    Token op = nextToken;
    int curOpPrec = getTokPrecedence(op.kind);

    if (curOpPrec < precedence)
      return lhs;

    ...
  }
}

If the precedence of the current operator is higher than or equal to the initial one, the RHS is parsed and the precedence of the upcoming operator is checked. If the precedence of the upcoming operator is higher, the current primary expression is the LHS of that operator, so it' RHS is parsed recursively. Otherwise, create a BinaryOperator node and make it the current LHS.

std::unique_ptr<Expr> Parser::parseExprRHS(std::unique_ptr<Expr> lhs,
                                           int precedence) {
  while (true) {
    ...
    eatNextToken(); // eat operator

    varOrReturn(rhs, parsePrimary());

    if (curOpPrec < getTokPrecedence(nextToken.kind)) {
      rhs = parseExprRHS(std::move(rhs), curOpPrec + 1);
      if (!rhs)
        return nullptr;
    }

    lhs = std::make_unique<BinaryOperator>(op.location, std::move(lhs),
                                           std::move(rhs), op.kind);
  }
}

Notice how parseExprRHS() is called recursively with curOpPrec + 1. The + 1 there controls the associativity of the current operator, with it present, the operator is left associative, without it it's right-associative. So far all of the supported operators are left-associative.

Expression: 2 + 3 * 4 + 5

// Assume left-associativity
1.) parseExpr()
(2 + 3 * 4 + 5
 ^

2.) LHS = parsePrimary() == 2
(2 + 3 * 4 + 5
   ^

    3.) parseExprRHS(LHS: 2, precedence: 0)
    (2 + 3 * 4 + 5
       ^ 'curTokPrec' (5) >= precedence (0)
    
    4.) eat operator, RHS = parsePrimary() == 3
    (2 + 3 * 4 + 5
           ^ 'curTokPrec' (5) < 'nextTokPrec' (6)
    
        5.) RHS = parseExprRHS(LHS: 3, precedence: 'curTokPrec' + 1 == 6)
        (2 + (3 * 4 + 5
                ^ 'curTokPrec' (6) >= precedence (6)
        
        6.) eat operator, RHS = parsePrimary() == 4
        (2 + (3 * 4) + 5
                     ^ 'curTokPrec' (5) < precedence (6), return

    7.) loop
    ((2 + (3 * 4)) + 5
                   ^ 'curTokPrec' (5) >= precedence (0)

    8.) eat operator, RHS = parsePrimary() == 5
    ((2 + (3 * 4)) + 5)
                       ^ out of tokens, return

// Assume right-associativity
1.) parseExpr()
(2 + 3 * 4 + 5
 ^

2.) LHS = parsePrimary() == 2
(2 + 3 * 4 + 5
   ^

    3.) parseExprRHS(LHS: 2, precedence: 0)
    (2 + 3 * 4 + 5
       ^ 'curTokPrec' (5) >= precedence (0)
    
    4.) eat operator, RHS = parsePrimary() == 3
    (2 + 3 * 4 + 5
           ^ 'curTokPrec' (5) < 'nextTokPrec' (6)
    
        5.) RHS = parseExprRHS(LHS: 3, precedence: 'curTokPrec' == 5)
        (2 + (3 * 4 + 5
                ^ 'curTokPrec' (6) >= precedence (5)
        
        6.) eat operator, RHS = parsePrimary() == 4
        (2 + ((3 * 4) + 5
                      ^ 'curTokPrec' (5) >= precedence (5)

        7.) eat operator, RHS = parsePrimary() == 5
        (2 + ((3 * 4) + 5)
                          ^ out of tokens, return
    8.) create binop with the returned RHS
    (2 + ((3 * 4) + 5))
                       ^

Design Note

A language can support both left and right-associative operators at the same time. C++ is one example, where the arithmetic operators are left associative, while assignments are right associative.

((1 + 2) + 3)

(x = (y = 0))

With operator precedence parsing it's simple to switch the associativity even in the middle of parsing an expression. It only needs to be ensured that in the recursive call, the precedence is not incremented and the recursive call is made even if the next operator has the same precedence level as the current one.

while (true) {
  ...
  
  bool isRightAssoc = isRightAssoc(op);
  int nextOpPrec = getTokPrecedence(nextToken.kind);

  if ((curOpPrec < nextOpPrec) || 
      (curOpPrec == nextOpPrec && isRightAssoc)) {
    rhs = parseExprRHS(std::move(rhs), curOpPrec + !isRightAssoc);
    if (!rhs)
      return nullptr;
  }

  ...
}

Adding Unary Operators

Unary operators can have a prefix or postfix form. Currently, only the prefix unary negation is supported.

<prefixExpression>
  ::= '-'* <postfixExpression>

The AST node of a unary operator is similar to a binary one in that its operand is stored and the token kind is used to represent the operator.

struct UnaryOperator : public Expr {
  std::unique_ptr<Expr> operand;
  TokenKind op;

  UnaryOperator(SourceLocation location,
                std::unique_ptr<Expr> operand,
                TokenKind op)
      : Expr(location),
        operand(std::move(operand)),
        op(op) {}

  void dump(size_t level = 0) const override;
};

The textual representation of the node also contains both the operator and its operand.

void UnaryOperator::dump(size_t level) const {
  std::cerr << indent(level) << "UnaryOperator: '" << getOpStr(op) << '\''
            << '\n';

  operand->dump(level + 1);
}

Prefix operators have higher precedence than any of the binary operators and are parsed in their dedicated method.

std::unique_ptr<Expr> Parser::parsePrefixExpr() {
  Token tok = nextToken;

  if (tok.kind != TokenKind::Minus)
    return parsePostfixExpr();
  eatNextToken(); // eat '-'

  varOrReturn(rhs, parsePrefixExpr());

  return std::make_unique<UnaryOperator>(tok.location, std::move(rhs),
                                         tok.kind);
}

With this addition, parseExpr() is modified to call parsePrefixExpr() instead of parsePrimary() when parsing the supposed LHS.

std::unique_ptr<Expr> Parser::parseExpr() {
  varOrReturn(lhs, parsePrefixExpr());
  ...
}

parseExprRHS() is also updated to parse a prefix expression when it expects an RHS.

std::unique_ptr<Expr> Parser::parseExprRHS(std::unique_ptr<Expr> lhs,
                                           int precedence) {
  while (true) {
    ...

    varOrReturn(rhs, parsePrefixExpr());

    ...
  }
}

Processing Operators

The resolved node of a BinaryOperator is identical to its non-resolved counterpart.

struct ResolvedBinaryOperator : public ResolvedExpr {
  TokenKind op;
  std::unique_ptr<ResolvedExpr> lhs;
  std::unique_ptr<ResolvedExpr> rhs;

  ResolvedBinaryOperator(SourceLocation location,
                         TokenKind op,
                         std::unique_ptr<ResolvedExpr> lhs,
                         std::unique_ptr<ResolvedExpr> rhs)
      : ResolvedExpr(location, lhs->type),
        op(op),
        lhs(std::move(lhs)),
        rhs(std::move(rhs)) {}

  void dump(size_t level = 0) const override;
};

Its textual representation is also the same.

void ResolvedBinaryOperator::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedBinaryOperator: '" << getOpStr(op)
            << '\'' << '\n';

  lhs->dump(level + 1);
  rhs->dump(level + 1);
}

Similarly, a ResolvedUnaryOperator also stores its operand and kind, like its AST node.

struct ResolvedUnaryOperator : public ResolvedExpr {
  TokenKind op;
  std::unique_ptr<ResolvedExpr> operand;

  ResolvedUnaryOperator(SourceLocation location,
                        TokenKind op,
                        std::unique_ptr<ResolvedExpr> operand)
      : ResolvedExpr(location, operand->type),
        op(op),
        operand(std::move(operand)) {}

  void dump(size_t level = 0) const override;
};

Its textual representation also includes both of these fields.

void ResolvedUnaryOperator::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedUnaryOperator: '" << getOpStr(op)
            << '\'' << '\n';

  if (auto val = getConstantValue())
    std::cerr << indent(level) << "| value: " << *val << '\n';

  operand->dump(level + 1);
}

The resolution of both operators is driven by resolveExpr().

std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
  ...

  if (const auto *binaryOperator = dynamic_cast<const BinaryOperator *>(&expr))
    return resolveBinaryOperator(*binaryOperator);

  if (const auto *unaryOperator = dynamic_cast<const UnaryOperator *>(&expr))
    return resolveUnaryOperator(*unaryOperator);

  ...
}

For the unary operator, its operand is resolved first. If the resolution succeeds and the type of the operand is not void the corresponding resolved node is returned. Otherwise, an error is reported.

std::unique_ptr<ResolvedUnaryOperator>
Sema::resolveUnaryOperator(const UnaryOperator &unary) {
  varOrReturn(resolvedRHS, resolveExpr(*unary.operand));

  if (resolvedRHS->type.kind == Type::Kind::Void)
    return report(
        resolvedRHS->location,
        "void expression cannot be used as an operand to unary operator");

  return std::make_unique<ResolvedUnaryOperator>(unary.location, unary.op,
                                                 std::move(resolvedRHS));
}

For binary operators, both the LHS and the RHS are resolved. If either of them is of void type an error is reported. Otherwise, it should be checked that the two operands have the same type, but in this case, when neither of the operands is void this is guaranteed by the type system.

std::unique_ptr<ResolvedBinaryOperator>
Sema::resolveBinaryOperator(const BinaryOperator &binop) {
  varOrReturn(resolvedLHS, resolveExpr(*binop.lhs));
  varOrReturn(resolvedRHS, resolveExpr(*binop.rhs));

  if (resolvedLHS->type.kind == Type::Kind::Void)
    return report(
        resolvedLHS->location,
        "void expression cannot be used as LHS operand to binary operator");

  if (resolvedRHS->type.kind == Type::Kind::Void)
    return report(
        resolvedRHS->location,
        "void expression cannot be used as RHS operand to binary operator");

  return std::make_unique<ResolvedBinaryOperator>(
      binop.location, std::move(resolvedLHS), std::move(resolvedRHS), binop.op);
}

Code generation is driven by generateExpr().

llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
  ...

  if (auto *binop = dynamic_cast<const ResolvedBinaryOperator *>(&expr))
    return generateBinaryOperator(*binop);

  if (auto *unop = dynamic_cast<const ResolvedUnaryOperator *>(&expr))
    return generateUnaryOperator(*unop);

  ...
}

The IRBuilder API provides a function for all of the common unary and binary operators, which makes generating code for operators simple and convenient.

For unary operators, the value of their sub-expression is generated first and the corresponding operation is performed on the value.

llvm::Value *Codegen::generateUnaryOperator(const ResolvedUnaryOperator &unop) {
  llvm::Value *rhs = generateExpr(*unop.rhs);

  if (unop.op == TokenKind::Minus)
    return builder.CreateFNeg(rhs);

  llvm_unreachable("unknown unary op");
  return nullptr;
}

Code generation for binary operators is a similar process. Both sides are generated first then the corresponding operation is inserted.

llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
  TokenKind op = binop.op;

  llvm::Value *lhs = generateExpr(*binop.lhs);
  llvm::Value *rhs = generateExpr(*binop.rhs);

  if (op == TokenKind::Plus)
    return builder.CreateFAdd(lhs, rhs);

  if (op == TokenKind::Minus)
    return builder.CreateFSub(lhs, rhs);

  if (op == TokenKind::Asterisk)
    return builder.CreateFMul(lhs, rhs);

  if (op == TokenKind::Slash)
    return builder.CreateFDiv(lhs, rhs);

  llvm_unreachable("unexpected binary operator");
  return nullptr;
}

Grouping Expression

Grouping expression is the last primary expression supported by the language. In the grammar, it is a wrapper around an arbitrary expression.

<primaryExpression>
  ::= ...
  |   '(' <expr> ')'

The AST node representing them is also a wrapper around an Expr node.

struct GroupingExpr : public Expr {
  std::unique_ptr<Expr> expr;

  GroupingExpr(SourceLocation location, std::unique_ptr<Expr> expr)
      : Expr(location),
        expr(std::move(expr)) {}

  void dump(size_t level = 0) const override;
};

Its textual representation is the name of the node and its enclosed expression.

void GroupingExpr::dump(size_t level) const {
  std::cerr << indent(level) << "GroupingExpr:\n";

  expr->dump(level + 1);
}

Grouping expressions are parsed directly in parsePrimary().

std::unique_ptr<Expr> Parser::parsePrimary() {
  ...

  if (nextToken.kind == TokenKind::Lpar) {
    eatNextToken(); // eat '('

    varOrReturn(expr, parseExpr());

    matchOrReturn(TokenKind::Rpar, "expected ')'");
    eatNextToken(); // eat ')'

    return std::make_unique<GroupingExpr>(location, std::move(expr));
  }

  ...
}

The resolved node of the expression is also a wrapper around a ResolvedExpr, similar to the non-resolved one.

struct ResolvedGroupingExpr : public ResolvedExpr {
  std::unique_ptr<ResolvedExpr> expr;

  ResolvedGroupingExpr(SourceLocation location,
                       std::unique_ptr<ResolvedExpr> expr)
      : ResolvedExpr(location, expr->type),
        expr(std::move(expr)) {}

  void dump(size_t level = 0) const override;
};

The textual representation of the resolved node is also identical to its non-resolved counterpart.

void ResolvedGroupingExpr::dump(size_t level) const {
  std::cerr << indent(level) << "ResolvedGroupingExpr:\n";

  expr->dump(level + 1);
}

As for every other expression, the resolution of the grouping expressions is also driven by resolveExpr().

std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
  ...

  if (const auto *groupingExpr = dynamic_cast<const GroupingExpr *>(&expr))
    return resolveGroupingExpr(*groupingExpr);

  ...
}

To resolve a GroupingExpr, only its wrapped expression needs to be resolved.

std::unique_ptr<ResolvedGroupingExpr>
Sema::resolveGroupingExpr(const GroupingExpr &grouping) {
  varOrReturn(resolvedExpr, resolveExpr(*grouping.expr));
  return std::make_unique<ResolvedGroupingExpr>(grouping.location,
                                                std::move(resolvedExpr));
}

Code generation is identical to the resolution process, the IR is generated for the wrapped expression.

llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
  ...

  if (auto *grouping = dynamic_cast<const ResolvedGroupingExpr *>(&expr))
    return generateExpr(*grouping->expr);

  ...
}

Adding More Operators

Arithmetic operators are handled, but a group of logical operators is still missing. These missing operators include ==, &&, ||, <, > and !. Each of these operators is also a new token in the lexer.

constexpr char singleCharTokens[] = {..., '<', '>', '!'};

enum class TokenKind : char {
  ...

  EqualEqual,
  AmpAmp,
  PipePipe,

  ...
  Lt = singleCharTokens[11],
  Gt = singleCharTokens[12],
  Excl = singleCharTokens[13],
};

The single-character tokens are lexed automatically, but for the two-character ones, the characters must be matched manually.

Token Lexer::getNextToken() {
  ...

  if (currentChar == '=' && peekNextChar() == '=') {
    eatNextChar();
    return Token{tokenStartLocation, TokenKind::EqualEqual};
  }

  if (currentChar == '&' && peekNextChar() == '&') {
    eatNextChar();
    return Token{tokenStartLocation, TokenKind::AmpAmp};
  }

  if (currentChar == '|' && peekNextChar() == '|') {
    eatNextChar();
    return Token{tokenStartLocation, TokenKind::PipePipe};
  }

  ...
}

The parsing of the binary operators is also automatically handled, once the precedence table is filled in.

int getTokPrecedence(TokenKind tok) {
  switch (tok) {
  ...
  case TokenKind::Lt:
  case TokenKind::Gt:
    return 4;
  case TokenKind::EqualEqual:
    return 3;
  case TokenKind::AmpAmp:
    return 2;
  case TokenKind::PipePipe:
    return 1;
  ...
  }
}

For the unary !, the corresponding parser method needs to be taught about the new token.

std::unique_ptr<Expr> Parser::parsePrefixExpr() {
  ...

  if (tok.kind != TokenKind::Excl && tok.kind != TokenKind::Minus)
    return parsePostfixExpr();
  eatNextToken(); // eat '!' or '-'

  ...
}

The semantic analyzer is not affected by the change, however, the code generator needs to be taught what instructions it should generate for each of the new operators.

Because these operators are supposed to return logical values, but there is no dedicated logical type in the language, they behave the same way as they do in C. They take and return numeric values and if the value is 0, it is considered logically false, otherwise the value is logically true.

LLVM IR however has a dedicated 1-bit integer value, which most of the conditional instructions take as a parameter. To be able to work with these instructions, two helper methods are introduced to convert between numeric and logical values.

For a conversion to a logical value, it's checked if the given numeric value is not equal to 0 and the result of the comparison is returned.

llvm::Value *Codegen::doubleToBool(llvm::Value *v) {
  return builder.CreateFCmpONE(
      v, llvm::ConstantFP::get(builder.getDoubleTy(), 0.0), "to.bool");
}

For the conversion back to double a simple cast instruction is inserted.

llvm::Value *Codegen::boolToDouble(llvm::Value *v) {
  return builder.CreateUIToFP(v, builder.getDoubleTy(), "to.double");
}

The ! operator negates the logical value of its operand, so the operand is converted to bool before the negation is performed. Once the value is negated, it is promoted back to a numeric value, because the language can only work with numbers.

llvm::Value *Codegen::generateUnaryOperator(const ResolvedUnaryOperator &unop) {
  ...

  if (unop.op == TokenKind::Excl)
    return boolToDouble(builder.CreateNot(doubleToBool(rhs)));

  ...
}

The ==, <, > binary operators also have a corresponding instruction in the IRBuilder which can be used similarly, between converting the operand to bool and promoting it back to double.

llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
  ...

  if (op == TokenKind::Lt)
    return boolToDouble(builder.CreateFCmpOLT(lhs, rhs));

  if (op == TokenKind::Gt)
    return boolToDouble(builder.CreateFCmpOGT(lhs, rhs));

  if (op == TokenKind::EqualEqual)
    return boolToDouble(builder.CreateFCmpOEQ(lhs, rhs));

  ...
}

All of the previously seen comparison functions end with three characters starting with O. This O means that an ordered comparison is performed. For example, OLT means Ordered Less Than.

LLVM also supports an unordered counterpart for each of the instructions like FCmpULT. The difference between the two is that the unordered version also takes NaN (Not a Number) values into account during the comparison, while the ordered version automatically returns false if either of the values is NaN.

One last change to make is making these operators appear correctly in the AST and resolved tree dumps, so the getOpStr() function is extended to handle them.

std::string_view getOpStr(TokenKind op) {
  ...
  if (op == TokenKind::EqualEqual)
    return "==";
  if (op == TokenKind::AmpAmp)
    return "&&";
  if (op == TokenKind::PipePipe)
    return "||";
  if (op == TokenKind::Lt)
    return "<";
  if (op == TokenKind::Gt)
    return ">";
  if (op == TokenKind::Excl)
    return "!";
  ...
}

Conditional Binary Operators

The conditional binary operators like && and || on the other hand need special handling because these operators are evaluated lazily.

For && first the left operand is evaluated and the right operand is only evaluated if the result is true. If the left operand is false, it's already known that the result of the operator is false, so there is no need to evaluate the right operand.

For || it's the opposite. If the first operand is true, the result of the operator is true, so the right operand doesn't have to be evaluated.

This is important to keep in mind since the operands can have side effects too. In the following program for example nothing should be printed to the standard output.

fn sideEffect(x: number): number {
    println(x);
    return x;
}

fn true(): number {
    return 1;
}

fn false(): number {
    return 0;
}

fn main(): void {
    true() || sideEffect(0);
    false() && sideEffect(1);
}

A different way to look at true() || sideEffect(0) is through the following C++ snippet.

auto lhs = true();

if(!lhs) {
  auto rhs = sideEffect(0);
  return rhs;
}

return true;

In LLVM IR this is modelled by introducing a different basic block for the right operand.

entry:
┌────────────────────────────────────────────────┐
│ %0 = call double @true()                       │
│ %to.bool = fcmp one double %0, 0.000000e+00    │
│ br i1 %to.bool, label %or.merge, label %or.rhs │
└────────────────────────────────────────────────┘
              |                         |
              |                         lhs false
              |             or.rhs      |
              |             ┌───────────────────────────────────────────────────┐
              |             │ %1 = call double @sideEffect(double 0.000000e+00) │
              lhs true      | %to.bool1 = fcmp one double %1, 0.000000e+00      │
              |             | br label %or.merge                                │
              |             └───────────────────────────────────────────────────┘
              |                         |
              |                         |
or.merge:     |                         |
┌──────────────────────────────────────────────────────┐
│ %2 = phi i1 [ %to.bool1, %or.rhs ], [ true, %entry ] │
│ ...                                                  │
└──────────────────────────────────────────────────────┘

So far implementing this seems simple, but it turns into a not-so-trivial problem once multiple operators are chained together. Consider the following example and how it's block structure should look like.

x || y && z || w
┌───┐
│ x │──F──┐
└───┘     │
  │     ┌───┐
  │     │ y │──T──┐
  │     └───┘     │
  │       │     ┌───┐
  T       │     │ z │──F──┐
  │       F     └───┘     │
  │       │       │     ┌───┐
  │       │       T     │ w │
  │       │       │     └───┘
  │      ┌─────────┐      │
  └──────│  merge  │──────┘
         └─────────┘

Before deciding how to generate something like this, it is worth taking a quick look at some examples and where each operand should lead.

x || y
     ┌────┐               ┌────┐     
  ┌──│ || │──┐            │ || │     
  │  └────┘  │            └────┘     
┌───┐      ┌───┐     ┌───┐      ┌───┐
│ x │      │ y │     │ x │─────>│ y │
└───┘      └───┘     └───┘      └───┘
                       │          │
x || y || z
           ┌────┐                     ┌────┐    
        ┌──│ || │──┐                  │ || │
        │  └────┘  │                  └────┘ 
     ┌────┐      ┌───┐          ┌────┐      ┌───┐
  ┌──│ || │──┐   │ z │          │ || │      │ z │
  │  └────┘  │   └───┘          └────┘      └───┘
┌───┐      ┌───┐           ┌───┐      ┌───┐   ^  
│ x │      │ y │           │ x │─────>│ y │───┘
└───┘      └───┘           └───┘      └───┘      
                             │          │
x || y && z || w
           ┌────┐                    ┌────┐
        ┌──│ || │──┐                 │ || │   
        │  └────┘  │                 └────┘   
     ┌────┐      ┌───┐         ┌────┐      ┌───┐
  ┌──│ || │──┐   │ w │         │ || │      │ w │
  │  └────┘  │   └───┘         └────┘      └───┘
┌───┐      ┌────┐         ┌───┐      ┌────┐  ^
│ x │   ┌──│ && │──┐      │ x │───┐  │ && │  │
└───┘   │  └────┘  │      └───┘   V  └────┘  │
      ┌───┐      ┌───┐      │   ┌───┐      ┌───┐
      │ y │      │ z │      │   │ y │─────>│ z │
      └───┘      └───┘      │   └───┘      └───┘
                            │     │          │
    

The observation is that from the basic block of each operand, one edge should lead to the merge block and the other one should lead to the block containing the next operand on the right in the AST.

The idea is to create a function that traverses the AST of a conditional binary operator and receives the two blocks the execution should jump to when the current operand evaluates to true and false.

void Codegen::generateConditionalOperator(const ResolvedExpr &op,
                                          llvm::BasicBlock *trueBB,
                                          llvm::BasicBlock *falseBB) {
  const auto *binop = dynamic_cast<const ResolvedBinaryOperator *>(&op);
  
  if (binop)
    return;
  
  llvm::Value *val = doubleToBool(generateExpr(op));
  builder.CreateCondBr(val, trueBB, falseBB);
};

When a conditional binary operator is visited, a new block is created for the RHS, but this new block needs to be treated differently for || and &&.

In the case of ||, this block is visited if the left operand is false, so it's called or.lhs.false and for && the block is visited if the LHS evaluates to true, so its name is and.lhs.true. Each of these blocks is then inserted into the current function.

void Codegen::generateConditionalOperator(const ResolvedExpr &op,
                                          llvm::BasicBlock *trueBB,
                                          llvm::BasicBlock *falseBB) {
  llvm::Function *function = getCurrentFunction();
  ...
  
  if (binop && binop->op == TokenKind::PipePipe) {
    llvm::BasicBlock *nextBB =
        llvm::BasicBlock::Create(context, "or.lhs.false", function);
    
    ...
    
    return;
  }

  if (binop && binop->op == TokenKind::AmpAmp) {
    llvm::BasicBlock *nextBB =
        llvm::BasicBlock::Create(context, "and.lhs.true", function);

    ...

    return;
  }

  ...

};

The getCurrentFunction() helper extracts the current function out of the builder.

llvm::Function *Codegen::getCurrentFunction() {
  return builder.GetInsertBlock()->getParent();
};

With the new block created, the LHS of the operator can be visited by making this new block the next false block for || and the next true block for &&.

void Codegen::generateConditionalOperator(const ResolvedExpr &op,
                                          llvm::BasicBlock *trueBB,
                                          llvm::BasicBlock *falseBB) {
  ...
  
  if (binop && binop->op == TokenKind::PipePipe) {
    ...
    generateConditionalOperator(*binop->lhs, trueBB, nextBB);
    ...
  }

  if (binop && binop->op == TokenKind::AmpAmp) {
    ...
    generateConditionalOperator(*binop->lhs, nextBB, falseBB);
    ...
  }

  ...

};

After the instructions for the LHS are generated, the insertion point is set to nextBB, so the next operand on the right is generated into this new block when it is visited.

void Codegen::generateConditionalOperator(const ResolvedExpr &op,
                                          llvm::BasicBlock *trueBB,
                                          llvm::BasicBlock *falseBB) {
  ...
  
  if (binop && binop->op == TokenKind::PipePipe) {
    ...

    builder.SetInsertPoint(nextBB);
    generateConditionalOperator(*binop->rhs, trueBB, falseBB);
    return;
  }

  if (binop && binop->op == TokenKind::AmpAmp) {
    ...

    builder.SetInsertPoint(nextBB);
    generateConditionalOperator(*binop->rhs, trueBB, falseBB);
    return;
  }

  ...

};

This function is initially called from generateBinaryOperator(), which first checks if the operator is an || or an && and creates the initial true and false blocks.

llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
  TokenKind op = binop.op;

  if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
    llvm::Function *function = getCurrentFunction();
    bool isOr = op == TokenKind::PipePipe;

    auto *rhsTag = isOr ? "or.rhs" : "and.rhs";
    auto *mergeTag = isOr ? "or.merge" : "and.merge";

    auto *rhsBB = llvm::BasicBlock::Create(context, rhsTag, function);
    auto *mergeBB = llvm::BasicBlock::Create(context, mergeTag, function);

    ...
  }

  ...
}

If the current operator is an ||, the true block is the merge block, while the false block is the RHS block. Remember, when the left operand is true, the right one is not evaluated. In the case of &&, the order is the opposite.

llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
  ...

  if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
    ...

    llvm::BasicBlock *trueBB = isOr ? mergeBB : rhsBB;
    llvm::BasicBlock *falseBB = isOr ? rhsBB : mergeBB;
    generateConditionalOperator(*binop.lhs, trueBB, falseBB);

    ...
  }

  ...
}

Then the RHS is visited and a jump to the merge block is inserted into the current block.

llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
  ...

  if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
    ...

    builder.SetInsertPoint(rhsBB);
    llvm::Value *rhs = doubleToBool(generateExpr(*binop.rhs));
    builder.CreateBr(mergeBB);

    ...
  }

  ...
}

Since generateExpr() can change the current block, rhsBB needs to be retrieved once again (consider x || (y && z) where the insertion point after the function is run is the and.merge block). Then the insertion point is set to the merge block and a phi node is created.

llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
  ...

  if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
    ...

    rhsBB = builder.GetInsertBlock();
    builder.SetInsertPoint(mergeBB);
    llvm::PHINode *phi = builder.CreatePHI(builder.getInt1Ty(), 2);

    ...
  }

  ...
}

From rhsBB the incoming value of the result is rhs, while from every other block, it's true in case of || and false for &&.

llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
  ...

  if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
    ...

    for (auto it = pred_begin(mergeBB); it != pred_end(mergeBB); ++it) {
      if (*it == rhsBB)
        phi->addIncoming(rhs, rhsBB);
      else
        phi->addIncoming(builder.getInt1(isOr), *it);
    }

    return boolToDouble(phi);
  }

  ...
}

For a better understanding consider r = x || y || z; and its simplified LLVM IR representation.

┌────────────────┐
│ bx:            │
│  $x            │─────────┐
│  br $x, bm, by │         │
└────────────────┘         v
         │        ┌────────────────┐
         │        │ by:            │
         │        │  $y            │─────────┐
         │        │  br $y, bm, bz │         │
         │        └────────────────┘         v
         │                 │        ┌────────────────┐
         │                 │        │ bz:            │
         │                 │        │  $z            │
         │                 │        │  br bm         │
         │                 │        └────────────────┘
         │                 │                 │
         v                 v                 v
┌────────────────────────────────────────────────────┐
│ bm:                                                │
│  $r = phi [$x, bx], [$y, by], [$z, bz]             │
└────────────────────────────────────────────────────┘

From the bx block, the execution only jumps to the bm block, if $x is true. Similarly, from the by block to jump to bm, $y must be true.

When bz is reached, $x and $y are guaranteed to be false. Since the results are ||-d together and false || false || z == z, the result only depends on the value of $z.

These observations are what generateBinaryOperator() uses during the creation of the phi node, which yields the following result.

$r = phi [true, bx], [true, by], [$z, bz]