To be able to handle basic arithmetics, operators like
+
, -
, *
and
/
need to be introduced. These operators also
have a precedence level, which the parser needs to be aware
of. It matters whether the expression
2 + 3 * 4
is parsed as
(2 + 3) * 4
or 2 + (3 * 4)
. An
additional primary expression, the grouping expression
( expression )
is also required to allow the
programmer to override the original precedence.
Operators are expressed in the grammar such that a new non-terminal is introduced for every precedence level and each of these non-terminals expands to the next precedence level in an increasing order from lowest to highest.
Precedence | Type | Symbols
───────────────────────────────────────────────────────
Highest │ Multiplicative │ *,/
Lowest │ Additive | +,-
<expr>
::= <additiveExpression>
<additiveExpression>
::= <multiplicativeExpression> (('+' | '-') <multiplicativeExpression>)*
<multiplicativeExpression>
::= <primaryExpr> (('*' | '/') <primaryExpr>)*
The purpose of putting the lowest precedence expression to
the top and the highest to the bottom is that it's always
the left-most non-terminal expanded first, so after the
first primaryExpression
is parsed, the parser
will immediately look for the highest precedence operator
first and as it backtracks, it checks the lowest one last.
Expression: 2 + 3
1.) go down the stack, parse the primary
| parseExpr() tok: 2
| <mul> (('+' | '-') <mul>)*
|
| parseMul()
| <pri> (('*' | '/') <pri>)*
|
| parsePri() eat: 2
| '2' tok: +
2.) go up the stack, match the operator
| return from parsePri
|
| '2' (('*' | '/') <pri>)*
| return from parseMul
|
| '2' (('+' | '-') <mul>)* eat: +
| ...
There are however a few drawbacks of using a recursive descent parser to parse operators. In production used languages there are more than 2 precedence levels. In Kotlin for example, there are 15. Creating and backtracking 15 stack frames for every operator has a significant impact on performance hence a different approach is used.
Other languages, like Haskell support user-defined operators with custom precedence level and associativity. In a recursive descent parser implementing such a feature is tricky, as it requires adding new functions during runtime and modifying which functions are being called next.
Operator precedence parsing is an algorithm that can address both of the above drawbacks. Also, this parsing technique can easily be combined with recursive descent.
Before implementing this parser, however, the lexer needs to be extended with the new operators.
constexpr char singleCharTokens[] = {..., '+', '-', '*'};
enum class TokenKind : char {
...
Slash,
...
Plus = singleCharTokens[8],
Minus = singleCharTokens[9],
Asterisk = singleCharTokens[10]
};
Since comments begin with //
the lexer now
needs to be able to differentiate between /
and
//
.
Token Lexer::getNextToken() {
...
if (currentChar == '/') {
if (peekNextChar() != '/')
return Token{tokenStartLocation, TokenKind::Slash};
...
}
...
}
In the AST each of these operators is represented by one single node, which stores their left and right side and the kind of the token that represents the operator.
struct BinaryOperator : public Expr {
std::unique_ptr<Expr> lhs;
std::unique_ptr<Expr> rhs;
TokenKind op;
BinaryOperator(SourceLocation location,
std::unique_ptr<Expr> lhs,
std::unique_ptr<Expr> rhs,
TokenKind op)
: Expr(location),
lhs(std::move(lhs)),
rhs(std::move(rhs)),
op(op) {}
void dump(size_t level = 0) const override;
};
Design Note
A more elegant representation for operators is creating a dedicated enum of the possible kinds instead of reusing the tokens, but for now, the compiler is kept as simple as possible.
The textual representation of
BinaryOperator
contains its operands and the
stored operator.
void BinaryOperator::dump(size_t level) const {
std::cerr << indent(level) << "BinaryOperator: '" << getOpStr(op) << '\''
<< '\n';
lhs->dump(level + 1);
rhs->dump(level + 1);
}
To print the correct operator, a dedicated helper is created that maps the operator tokens to their textual representations.
std::string_view getOpStr(TokenKind op) {
if (op == TokenKind::Plus)
return "+";
if (op == TokenKind::Minus)
return "-";
if (op == TokenKind::Asterisk)
return "*";
if (op == TokenKind::Slash)
return "/";
llvm_unreachable("unexpected operator");
}
The idea is to parse the LHS of an operator first and then
parse each of the following operators together with their
RHS. In the case of
2 + 3 * 4
, it means parsing 2
,
then + 3
and finally * 4
.
When an operator and its RHS are parsed, it needs to be decided whether that RHS is really the RHS of that operator or the LHS of the upcoming one. This is done by checking whether the precedence of the upcoming operator is higher than the precedence of the current one.
For example in 2 + 3 * 4
the RHS of
+
is 3
. However keeping
3
as the RHS here would mean, that the
expression is parsed as (2 + 3) * 4
. Instead,
it needs to be considered as the LHS of the upcoming
*
, because *
has a higher
precedence than +
, so that the expression is
parsed correctly as 2 + (3 * 4)
.
┌─────────────────────────────────────────────────────┐
| Operator | Precedence |
├─────────────────────────────────────────────────────┤
| *, / | 6 |
| +, - | 5 |
└─────────────────────────────────────────────────────┘
Expression: 2 + 3 * 4
1.) parse LHS outside the operator precedence parser
2 + 3 * 4 ;
^
2.) invoke the parser with LHS being 2 and parse the
next OP and its RHS
OP: (precedence: 0) LHS: 2 RHS:
2 + 3 * 4 ;
^
OP: + (precedence: 5) LHS: 2 RHS:
2 + 3 * 4 ;
^
3.) check the precedence of the upcoming OP
OP: + (precedence: 5) LHS: 2 RHS: 3
2 + 3 * 4 ;
^ '*' has a higher precedence (6) than '+' (5)
4.) halt the parsing of '+' and invoke the parser
recursively with LHS set to 3 and the current
precedence level increased by one
OP: (precedence: 6) LHS: 3 RHS:
2 + 3 * 4 ;
^
5.) parse the next OP and it's RHS
OP: * (precedence: 6) LHS: 3 RHS:
2 + 3 * 4 ;
^
OP: * (precedence: 6) LHS: 3 RHS: 4
2 + 3 * 4 ;
^ return BinOp('*', 3, 4)
6.) resume the parsing of '+' and make the returned
BinOp it's rhs
OP: + (precedence: 5) LHS: 2 RHS: 3 * 4
2 + 3 * 4 ;
^ return BinOp('+', 2, BinOp('*', 3, 4))
To implement this logic, first, a helper is defined that returns the token precedence.
int getTokPrecedence(TokenKind tok) {
switch (tok) {
case TokenKind::Asterisk:
case TokenKind::Slash:
return 6;
case TokenKind::Plus:
case TokenKind::Minus:
return 5;
default:
return -1;
}
}
Design Note
The precedence table is hardcoded right now to make retrieving the precedence as quick as possible.
If a language however supports user-defined operators, the precedence table needs to be modified during the parsing of the file. In such cases, an
std::unordered_map
or similar data structure might be used.
The parseExpr()
function houses the driver of
the operator precedence parser. It first parses the initial
primary expression and then parses the RHS for it. The
parseExprRHS()
houses the core logic of the
operator precedence parser and as a result, it needs to know
about the precedence of the current operator. Since
initially there is no operator, it's given 0
as
the precedence.
std::unique_ptr<Expr> Parser::parseExpr() {
varOrReturn(lhs, parsePrimary());
return parseExprRHS(std::move(lhs), 0);
}
The core of the parser is a loop, that keeps eating the operators and their RHS expressions until it finds an operator with a precedence less than the initial precedence.
std::unique_ptr<Expr> Parser::parseExprRHS(std::unique_ptr<Expr> lhs,
int precedence) {
while (true) {
Token op = nextToken;
int curOpPrec = getTokPrecedence(op.kind);
if (curOpPrec < precedence)
return lhs;
...
}
}
If the precedence of the current operator is higher than or
equal to the initial one, the RHS is parsed and the
precedence of the upcoming operator is checked. If the
precedence of the upcoming operator is higher, the current
primary expression is the LHS of that operator, so it' RHS
is parsed recursively. Otherwise, create a
BinaryOperator
node and make it the current
LHS.
std::unique_ptr<Expr> Parser::parseExprRHS(std::unique_ptr<Expr> lhs,
int precedence) {
while (true) {
...
eatNextToken(); // eat operator
varOrReturn(rhs, parsePrimary());
if (curOpPrec < getTokPrecedence(nextToken.kind)) {
rhs = parseExprRHS(std::move(rhs), curOpPrec + 1);
if (!rhs)
return nullptr;
}
lhs = std::make_unique<BinaryOperator>(op.location, std::move(lhs),
std::move(rhs), op.kind);
}
}
Notice how parseExprRHS()
is called recursively
with curOpPrec + 1
. The + 1
there
controls the associativity of the current operator, with it
present, the operator is left associative, without it it's
right-associative. So far all of the supported operators are
left-associative.
Expression: 2 + 3 * 4 + 5
// Assume left-associativity
1.) parseExpr()
(2 + 3 * 4 + 5
^
2.) LHS = parsePrimary() == 2
(2 + 3 * 4 + 5
^
3.) parseExprRHS(LHS: 2, precedence: 0)
(2 + 3 * 4 + 5
^ 'curTokPrec' (5) >= precedence (0)
4.) eat operator, RHS = parsePrimary() == 3
(2 + 3 * 4 + 5
^ 'curTokPrec' (5) < 'nextTokPrec' (6)
5.) RHS = parseExprRHS(LHS: 3, precedence: 'curTokPrec' + 1 == 6)
(2 + (3 * 4 + 5
^ 'curTokPrec' (6) >= precedence (6)
6.) eat operator, RHS = parsePrimary() == 4
(2 + (3 * 4) + 5
^ 'curTokPrec' (5) < precedence (6), return
7.) loop
((2 + (3 * 4)) + 5
^ 'curTokPrec' (5) >= precedence (0)
8.) eat operator, RHS = parsePrimary() == 5
((2 + (3 * 4)) + 5)
^ out of tokens, return
// Assume right-associativity
1.) parseExpr()
(2 + 3 * 4 + 5
^
2.) LHS = parsePrimary() == 2
(2 + 3 * 4 + 5
^
3.) parseExprRHS(LHS: 2, precedence: 0)
(2 + 3 * 4 + 5
^ 'curTokPrec' (5) >= precedence (0)
4.) eat operator, RHS = parsePrimary() == 3
(2 + 3 * 4 + 5
^ 'curTokPrec' (5) < 'nextTokPrec' (6)
5.) RHS = parseExprRHS(LHS: 3, precedence: 'curTokPrec' == 5)
(2 + (3 * 4 + 5
^ 'curTokPrec' (6) >= precedence (5)
6.) eat operator, RHS = parsePrimary() == 4
(2 + ((3 * 4) + 5
^ 'curTokPrec' (5) >= precedence (5)
7.) eat operator, RHS = parsePrimary() == 5
(2 + ((3 * 4) + 5)
^ out of tokens, return
8.) create binop with the returned RHS
(2 + ((3 * 4) + 5))
^
Design Note
A language can support both left and right-associative operators at the same time. C++ is one example, where the arithmetic operators are left associative, while assignments are right associative.
((1 + 2) + 3) (x = (y = 0))
With operator precedence parsing it's simple to switch the associativity even in the middle of parsing an expression. It only needs to be ensured that in the recursive call, the precedence is not incremented and the recursive call is made even if the next operator has the same precedence level as the current one.
while (true) { ... bool isRightAssoc = isRightAssoc(op); int nextOpPrec = getTokPrecedence(nextToken.kind); if ((curOpPrec < nextOpPrec) || (curOpPrec == nextOpPrec && isRightAssoc)) { rhs = parseExprRHS(std::move(rhs), curOpPrec + !isRightAssoc); if (!rhs) return nullptr; } ... }
Unary operators can have a prefix or postfix form. Currently, only the prefix unary negation is supported.
<prefixExpression>
::= '-'* <postfixExpression>
The AST node of a unary operator is similar to a binary one in that its operand is stored and the token kind is used to represent the operator.
struct UnaryOperator : public Expr {
std::unique_ptr<Expr> operand;
TokenKind op;
UnaryOperator(SourceLocation location,
std::unique_ptr<Expr> operand,
TokenKind op)
: Expr(location),
operand(std::move(operand)),
op(op) {}
void dump(size_t level = 0) const override;
};
The textual representation of the node also contains both the operator and its operand.
void UnaryOperator::dump(size_t level) const {
std::cerr << indent(level) << "UnaryOperator: '" << getOpStr(op) << '\''
<< '\n';
operand->dump(level + 1);
}
Prefix operators have higher precedence than any of the binary operators and are parsed in their dedicated method.
std::unique_ptr<Expr> Parser::parsePrefixExpr() {
Token tok = nextToken;
if (tok.kind != TokenKind::Minus)
return parsePostfixExpr();
eatNextToken(); // eat '-'
varOrReturn(rhs, parsePrefixExpr());
return std::make_unique<UnaryOperator>(tok.location, std::move(rhs),
tok.kind);
}
With this addition, parseExpr()
is modified to
call parsePrefixExpr()
instead of
parsePrimary()
when parsing the supposed LHS.
std::unique_ptr<Expr> Parser::parseExpr() {
varOrReturn(lhs, parsePrefixExpr());
...
}
parseExprRHS()
is also updated to parse a
prefix expression when it expects an RHS.
std::unique_ptr<Expr> Parser::parseExprRHS(std::unique_ptr<Expr> lhs,
int precedence) {
while (true) {
...
varOrReturn(rhs, parsePrefixExpr());
...
}
}
The resolved node of a BinaryOperator
is
identical to its non-resolved counterpart.
struct ResolvedBinaryOperator : public ResolvedExpr {
TokenKind op;
std::unique_ptr<ResolvedExpr> lhs;
std::unique_ptr<ResolvedExpr> rhs;
ResolvedBinaryOperator(SourceLocation location,
TokenKind op,
std::unique_ptr<ResolvedExpr> lhs,
std::unique_ptr<ResolvedExpr> rhs)
: ResolvedExpr(location, lhs->type),
op(op),
lhs(std::move(lhs)),
rhs(std::move(rhs)) {}
void dump(size_t level = 0) const override;
};
Its textual representation is also the same.
void ResolvedBinaryOperator::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedBinaryOperator: '" << getOpStr(op)
<< '\'' << '\n';
lhs->dump(level + 1);
rhs->dump(level + 1);
}
Similarly, a ResolvedUnaryOperator
also stores
its operand and kind, like its AST node.
struct ResolvedUnaryOperator : public ResolvedExpr {
TokenKind op;
std::unique_ptr<ResolvedExpr> operand;
ResolvedUnaryOperator(SourceLocation location,
TokenKind op,
std::unique_ptr<ResolvedExpr> operand)
: ResolvedExpr(location, operand->type),
op(op),
operand(std::move(operand)) {}
void dump(size_t level = 0) const override;
};
Its textual representation also includes both of these fields.
void ResolvedUnaryOperator::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedUnaryOperator: '" << getOpStr(op)
<< '\'' << '\n';
if (auto val = getConstantValue())
std::cerr << indent(level) << "| value: " << *val << '\n';
operand->dump(level + 1);
}
The resolution of both operators is driven by
resolveExpr()
.
std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
...
if (const auto *binaryOperator = dynamic_cast<const BinaryOperator *>(&expr))
return resolveBinaryOperator(*binaryOperator);
if (const auto *unaryOperator = dynamic_cast<const UnaryOperator *>(&expr))
return resolveUnaryOperator(*unaryOperator);
...
}
For the unary operator, its operand is resolved first. If
the resolution succeeds and the type of the operand is not
void
the corresponding resolved node is
returned. Otherwise, an error is reported.
std::unique_ptr<ResolvedUnaryOperator>
Sema::resolveUnaryOperator(const UnaryOperator &unary) {
varOrReturn(resolvedRHS, resolveExpr(*unary.operand));
if (resolvedRHS->type.kind == Type::Kind::Void)
return report(
resolvedRHS->location,
"void expression cannot be used as an operand to unary operator");
return std::make_unique<ResolvedUnaryOperator>(unary.location, unary.op,
std::move(resolvedRHS));
}
For binary operators, both the LHS and the RHS are resolved.
If either of them is of void
type an error is
reported. Otherwise, it should be checked that the two
operands have the same type, but in this case, when neither
of the operands is void
this is guaranteed by
the type system.
std::unique_ptr<ResolvedBinaryOperator>
Sema::resolveBinaryOperator(const BinaryOperator &binop) {
varOrReturn(resolvedLHS, resolveExpr(*binop.lhs));
varOrReturn(resolvedRHS, resolveExpr(*binop.rhs));
if (resolvedLHS->type.kind == Type::Kind::Void)
return report(
resolvedLHS->location,
"void expression cannot be used as LHS operand to binary operator");
if (resolvedRHS->type.kind == Type::Kind::Void)
return report(
resolvedRHS->location,
"void expression cannot be used as RHS operand to binary operator");
return std::make_unique<ResolvedBinaryOperator>(
binop.location, std::move(resolvedLHS), std::move(resolvedRHS), binop.op);
}
Code generation is driven by generateExpr()
.
llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
...
if (auto *binop = dynamic_cast<const ResolvedBinaryOperator *>(&expr))
return generateBinaryOperator(*binop);
if (auto *unop = dynamic_cast<const ResolvedUnaryOperator *>(&expr))
return generateUnaryOperator(*unop);
...
}
The IRBuilder API provides a function for all of the common unary and binary operators, which makes generating code for operators simple and convenient.
For unary operators, the value of their sub-expression is generated first and the corresponding operation is performed on the value.
llvm::Value *Codegen::generateUnaryOperator(const ResolvedUnaryOperator &unop) {
llvm::Value *rhs = generateExpr(*unop.rhs);
if (unop.op == TokenKind::Minus)
return builder.CreateFNeg(rhs);
llvm_unreachable("unknown unary op");
return nullptr;
}
Code generation for binary operators is a similar process. Both sides are generated first then the corresponding operation is inserted.
llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
TokenKind op = binop.op;
llvm::Value *lhs = generateExpr(*binop.lhs);
llvm::Value *rhs = generateExpr(*binop.rhs);
if (op == TokenKind::Plus)
return builder.CreateFAdd(lhs, rhs);
if (op == TokenKind::Minus)
return builder.CreateFSub(lhs, rhs);
if (op == TokenKind::Asterisk)
return builder.CreateFMul(lhs, rhs);
if (op == TokenKind::Slash)
return builder.CreateFDiv(lhs, rhs);
llvm_unreachable("unexpected binary operator");
return nullptr;
}
Grouping expression is the last primary expression supported by the language. In the grammar, it is a wrapper around an arbitrary expression.
<primaryExpression>
::= ...
| '(' <expr> ')'
The AST node representing them is also a wrapper around an
Expr
node.
struct GroupingExpr : public Expr {
std::unique_ptr<Expr> expr;
GroupingExpr(SourceLocation location, std::unique_ptr<Expr> expr)
: Expr(location),
expr(std::move(expr)) {}
void dump(size_t level = 0) const override;
};
Its textual representation is the name of the node and its enclosed expression.
void GroupingExpr::dump(size_t level) const {
std::cerr << indent(level) << "GroupingExpr:\n";
expr->dump(level + 1);
}
Grouping expressions are parsed directly in
parsePrimary()
.
std::unique_ptr<Expr> Parser::parsePrimary() {
...
if (nextToken.kind == TokenKind::Lpar) {
eatNextToken(); // eat '('
varOrReturn(expr, parseExpr());
matchOrReturn(TokenKind::Rpar, "expected ')'");
eatNextToken(); // eat ')'
return std::make_unique<GroupingExpr>(location, std::move(expr));
}
...
}
The resolved node of the expression is also a wrapper around
a ResolvedExpr
, similar to the non-resolved
one.
struct ResolvedGroupingExpr : public ResolvedExpr {
std::unique_ptr<ResolvedExpr> expr;
ResolvedGroupingExpr(SourceLocation location,
std::unique_ptr<ResolvedExpr> expr)
: ResolvedExpr(location, expr->type),
expr(std::move(expr)) {}
void dump(size_t level = 0) const override;
};
The textual representation of the resolved node is also identical to its non-resolved counterpart.
void ResolvedGroupingExpr::dump(size_t level) const {
std::cerr << indent(level) << "ResolvedGroupingExpr:\n";
expr->dump(level + 1);
}
As for every other expression, the resolution of the
grouping expressions is also driven by
resolveExpr()
.
std::unique_ptr<ResolvedExpr> Sema::resolveExpr(const Expr &expr) {
...
if (const auto *groupingExpr = dynamic_cast<const GroupingExpr *>(&expr))
return resolveGroupingExpr(*groupingExpr);
...
}
To resolve a GroupingExpr
, only its wrapped
expression needs to be resolved.
std::unique_ptr<ResolvedGroupingExpr>
Sema::resolveGroupingExpr(const GroupingExpr &grouping) {
varOrReturn(resolvedExpr, resolveExpr(*grouping.expr));
return std::make_unique<ResolvedGroupingExpr>(grouping.location,
std::move(resolvedExpr));
}
Code generation is identical to the resolution process, the IR is generated for the wrapped expression.
llvm::Value *Codegen::generateExpr(const ResolvedExpr &expr) {
...
if (auto *grouping = dynamic_cast<const ResolvedGroupingExpr *>(&expr))
return generateExpr(*grouping->expr);
...
}
Arithmetic operators are handled, but a group of logical
operators is still missing. These missing operators include
==
, &&
, ||
,
<
, >
and !
. Each of
these operators is also a new token in the lexer.
constexpr char singleCharTokens[] = {..., '<', '>', '!'};
enum class TokenKind : char {
...
EqualEqual,
AmpAmp,
PipePipe,
...
Lt = singleCharTokens[11],
Gt = singleCharTokens[12],
Excl = singleCharTokens[13],
};
The single-character tokens are lexed automatically, but for the two-character ones, the characters must be matched manually.
Token Lexer::getNextToken() {
...
if (currentChar == '=' && peekNextChar() == '=') {
eatNextChar();
return Token{tokenStartLocation, TokenKind::EqualEqual};
}
if (currentChar == '&' && peekNextChar() == '&') {
eatNextChar();
return Token{tokenStartLocation, TokenKind::AmpAmp};
}
if (currentChar == '|' && peekNextChar() == '|') {
eatNextChar();
return Token{tokenStartLocation, TokenKind::PipePipe};
}
...
}
The parsing of the binary operators is also automatically handled, once the precedence table is filled in.
int getTokPrecedence(TokenKind tok) {
switch (tok) {
...
case TokenKind::Lt:
case TokenKind::Gt:
return 4;
case TokenKind::EqualEqual:
return 3;
case TokenKind::AmpAmp:
return 2;
case TokenKind::PipePipe:
return 1;
...
}
}
For the unary !
, the corresponding parser
method needs to be taught about the new token.
std::unique_ptr<Expr> Parser::parsePrefixExpr() {
...
if (tok.kind != TokenKind::Excl && tok.kind != TokenKind::Minus)
return parsePostfixExpr();
eatNextToken(); // eat '!' or '-'
...
}
The semantic analyzer is not affected by the change, however, the code generator needs to be taught what instructions it should generate for each of the new operators.
Because these operators are supposed to return logical
values, but there is no dedicated logical type in the
language, they behave the same way as they do in C. They
take and return numeric values and if the value is
0
, it is considered logically false, otherwise
the value is logically true.
LLVM IR however has a dedicated 1-bit integer value, which most of the conditional instructions take as a parameter. To be able to work with these instructions, two helper methods are introduced to convert between numeric and logical values.
For a conversion to a logical value, it's checked if the
given numeric value is not equal to 0
and the
result of the comparison is returned.
llvm::Value *Codegen::doubleToBool(llvm::Value *v) {
return builder.CreateFCmpONE(
v, llvm::ConstantFP::get(builder.getDoubleTy(), 0.0), "to.bool");
}
For the conversion back to double a simple cast instruction is inserted.
llvm::Value *Codegen::boolToDouble(llvm::Value *v) {
return builder.CreateUIToFP(v, builder.getDoubleTy(), "to.double");
}
The !
operator negates the logical value of its
operand, so the operand is converted to bool before the
negation is performed. Once the value is negated, it is
promoted back to a numeric value, because the language can
only work with numbers.
llvm::Value *Codegen::generateUnaryOperator(const ResolvedUnaryOperator &unop) {
...
if (unop.op == TokenKind::Excl)
return boolToDouble(builder.CreateNot(doubleToBool(rhs)));
...
}
The ==
, <
, >
binary
operators also have a corresponding instruction in the
IRBuilder
which can be used similarly, between
converting the operand to bool and promoting it back to
double.
llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
...
if (op == TokenKind::Lt)
return boolToDouble(builder.CreateFCmpOLT(lhs, rhs));
if (op == TokenKind::Gt)
return boolToDouble(builder.CreateFCmpOGT(lhs, rhs));
if (op == TokenKind::EqualEqual)
return boolToDouble(builder.CreateFCmpOEQ(lhs, rhs));
...
}
All of the previously seen comparison functions end with
three characters starting with O
. This
O
means that an ordered comparison is
performed. For example, OLT
means Ordered Less
Than.
LLVM also supports an unordered counterpart for each of the
instructions like FCmpULT
. The difference
between the two is that the unordered version also takes
NaN
(Not a Number) values into account during
the comparison, while the ordered version automatically
returns false if either of the values is NaN
.
One last change to make is making these operators appear
correctly in the AST and resolved tree dumps, so the
getOpStr()
function is extended to handle them.
std::string_view getOpStr(TokenKind op) {
...
if (op == TokenKind::EqualEqual)
return "==";
if (op == TokenKind::AmpAmp)
return "&&";
if (op == TokenKind::PipePipe)
return "||";
if (op == TokenKind::Lt)
return "<";
if (op == TokenKind::Gt)
return ">";
if (op == TokenKind::Excl)
return "!";
...
}
The conditional binary operators like &&
and
||
on the other hand need special handling
because these operators are evaluated lazily.
For &&
first the left operand is evaluated and
the right operand is only evaluated if the result is true.
If the left operand is false, it's already known that the
result of the operator is false, so there is no need to
evaluate the right operand.
For ||
it's the opposite. If the first operand
is true, the result of the operator is true, so the right
operand doesn't have to be evaluated.
This is important to keep in mind since the operands can have side effects too. In the following program for example nothing should be printed to the standard output.
fn sideEffect(x: number): number {
println(x);
return x;
}
fn true(): number {
return 1;
}
fn false(): number {
return 0;
}
fn main(): void {
true() || sideEffect(0);
false() && sideEffect(1);
}
A different way to look at
true() || sideEffect(0)
is through the
following C++ snippet.
auto lhs = true();
if(!lhs) {
auto rhs = sideEffect(0);
return rhs;
}
return true;
In LLVM IR this is modelled by introducing a different basic block for the right operand.
entry:
┌────────────────────────────────────────────────┐
│ %0 = call double @true() │
│ %to.bool = fcmp one double %0, 0.000000e+00 │
│ br i1 %to.bool, label %or.merge, label %or.rhs │
└────────────────────────────────────────────────┘
| |
| lhs false
| or.rhs |
| ┌───────────────────────────────────────────────────┐
| │ %1 = call double @sideEffect(double 0.000000e+00) │
lhs true | %to.bool1 = fcmp one double %1, 0.000000e+00 │
| | br label %or.merge │
| └───────────────────────────────────────────────────┘
| |
| |
or.merge: | |
┌──────────────────────────────────────────────────────┐
│ %2 = phi i1 [ %to.bool1, %or.rhs ], [ true, %entry ] │
│ ... │
└──────────────────────────────────────────────────────┘
So far implementing this seems simple, but it turns into a not-so-trivial problem once multiple operators are chained together. Consider the following example and how it's block structure should look like.
x || y && z || w
┌───┐
│ x │──F──┐
└───┘ │
│ ┌───┐
│ │ y │──T──┐
│ └───┘ │
│ │ ┌───┐
T │ │ z │──F──┐
│ F └───┘ │
│ │ │ ┌───┐
│ │ T │ w │
│ │ │ └───┘
│ ┌─────────┐ │
└──────│ merge │──────┘
└─────────┘
Before deciding how to generate something like this, it is worth taking a quick look at some examples and where each operand should lead.
x || y
┌────┐ ┌────┐
┌──│ || │──┐ │ || │
│ └────┘ │ └────┘
┌───┐ ┌───┐ ┌───┐ ┌───┐
│ x │ │ y │ │ x │─────>│ y │
└───┘ └───┘ └───┘ └───┘
│ │
x || y || z
┌────┐ ┌────┐
┌──│ || │──┐ │ || │
│ └────┘ │ └────┘
┌────┐ ┌───┐ ┌────┐ ┌───┐
┌──│ || │──┐ │ z │ │ || │ │ z │
│ └────┘ │ └───┘ └────┘ └───┘
┌───┐ ┌───┐ ┌───┐ ┌───┐ ^
│ x │ │ y │ │ x │─────>│ y │───┘
└───┘ └───┘ └───┘ └───┘
│ │
x || y && z || w
┌────┐ ┌────┐
┌──│ || │──┐ │ || │
│ └────┘ │ └────┘
┌────┐ ┌───┐ ┌────┐ ┌───┐
┌──│ || │──┐ │ w │ │ || │ │ w │
│ └────┘ │ └───┘ └────┘ └───┘
┌───┐ ┌────┐ ┌───┐ ┌────┐ ^
│ x │ ┌──│ && │──┐ │ x │───┐ │ && │ │
└───┘ │ └────┘ │ └───┘ V └────┘ │
┌───┐ ┌───┐ │ ┌───┐ ┌───┐
│ y │ │ z │ │ │ y │─────>│ z │
└───┘ └───┘ │ └───┘ └───┘
│ │ │
The observation is that from the basic block of each operand, one edge should lead to the merge block and the other one should lead to the block containing the next operand on the right in the AST.
The idea is to create a function that traverses the AST of a conditional binary operator and receives the two blocks the execution should jump to when the current operand evaluates to true and false.
void Codegen::generateConditionalOperator(const ResolvedExpr &op,
llvm::BasicBlock *trueBB,
llvm::BasicBlock *falseBB) {
const auto *binop = dynamic_cast<const ResolvedBinaryOperator *>(&op);
if (binop)
return;
llvm::Value *val = doubleToBool(generateExpr(op));
builder.CreateCondBr(val, trueBB, falseBB);
};
When a conditional binary operator is visited, a new block
is created for the RHS, but this new block needs to be
treated differently for ||
and &&
.
In the case of ||
, this block is visited if the
left operand is false, so it's called
or.lhs.false
and for &&
the block
is visited if the LHS evaluates to true, so its name is
and.lhs.true
. Each of these blocks is then
inserted into the current function.
void Codegen::generateConditionalOperator(const ResolvedExpr &op,
llvm::BasicBlock *trueBB,
llvm::BasicBlock *falseBB) {
llvm::Function *function = getCurrentFunction();
...
if (binop && binop->op == TokenKind::PipePipe) {
llvm::BasicBlock *nextBB =
llvm::BasicBlock::Create(context, "or.lhs.false", function);
...
return;
}
if (binop && binop->op == TokenKind::AmpAmp) {
llvm::BasicBlock *nextBB =
llvm::BasicBlock::Create(context, "and.lhs.true", function);
...
return;
}
...
};
The getCurrentFunction()
helper extracts the
current function out of the builder.
llvm::Function *Codegen::getCurrentFunction() {
return builder.GetInsertBlock()->getParent();
};
With the new block created, the LHS of the operator can be
visited by making this new block the next false block for
||
and the next true block for &&
.
void Codegen::generateConditionalOperator(const ResolvedExpr &op,
llvm::BasicBlock *trueBB,
llvm::BasicBlock *falseBB) {
...
if (binop && binop->op == TokenKind::PipePipe) {
...
generateConditionalOperator(*binop->lhs, trueBB, nextBB);
...
}
if (binop && binop->op == TokenKind::AmpAmp) {
...
generateConditionalOperator(*binop->lhs, nextBB, falseBB);
...
}
...
};
After the instructions for the LHS are generated, the
insertion point is set to nextBB
, so the next
operand on the right is generated into this new block when
it is visited.
void Codegen::generateConditionalOperator(const ResolvedExpr &op,
llvm::BasicBlock *trueBB,
llvm::BasicBlock *falseBB) {
...
if (binop && binop->op == TokenKind::PipePipe) {
...
builder.SetInsertPoint(nextBB);
generateConditionalOperator(*binop->rhs, trueBB, falseBB);
return;
}
if (binop && binop->op == TokenKind::AmpAmp) {
...
builder.SetInsertPoint(nextBB);
generateConditionalOperator(*binop->rhs, trueBB, falseBB);
return;
}
...
};
This function is initially called from
generateBinaryOperator()
, which first checks if
the operator is an ||
or an &&
and
creates the initial true and false blocks.
llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
TokenKind op = binop.op;
if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
llvm::Function *function = getCurrentFunction();
bool isOr = op == TokenKind::PipePipe;
auto *rhsTag = isOr ? "or.rhs" : "and.rhs";
auto *mergeTag = isOr ? "or.merge" : "and.merge";
auto *rhsBB = llvm::BasicBlock::Create(context, rhsTag, function);
auto *mergeBB = llvm::BasicBlock::Create(context, mergeTag, function);
...
}
...
}
If the current operator is an ||
, the true
block is the merge block, while the false block is the RHS
block. Remember, when the left operand is true, the right
one is not evaluated. In the case of &&
, the
order is the opposite.
llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
...
if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
...
llvm::BasicBlock *trueBB = isOr ? mergeBB : rhsBB;
llvm::BasicBlock *falseBB = isOr ? rhsBB : mergeBB;
generateConditionalOperator(*binop.lhs, trueBB, falseBB);
...
}
...
}
Then the RHS is visited and a jump to the merge block is inserted into the current block.
llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
...
if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
...
builder.SetInsertPoint(rhsBB);
llvm::Value *rhs = doubleToBool(generateExpr(*binop.rhs));
builder.CreateBr(mergeBB);
...
}
...
}
Since generateExpr()
can change the current
block, rhsBB
needs to be retrieved once again
(consider x || (y && z)
where the insertion
point after the function is run is the
and.merge
block). Then the insertion point is
set to the merge block and a phi node is created.
llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
...
if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
...
rhsBB = builder.GetInsertBlock();
builder.SetInsertPoint(mergeBB);
llvm::PHINode *phi = builder.CreatePHI(builder.getInt1Ty(), 2);
...
}
...
}
From rhsBB
the incoming value of the result is
rhs
, while from every other block, it's true in
case of ||
and false for &&
.
llvm::Value *
Codegen::generateBinaryOperator(const ResolvedBinaryOperator &binop) {
...
if (op == TokenKind::AmpAmp || op == TokenKind::PipePipe) {
...
for (auto it = pred_begin(mergeBB); it != pred_end(mergeBB); ++it) {
if (*it == rhsBB)
phi->addIncoming(rhs, rhsBB);
else
phi->addIncoming(builder.getInt1(isOr), *it);
}
return boolToDouble(phi);
}
...
}
For a better understanding consider
r = x || y || z;
and its simplified LLVM IR
representation.
┌────────────────┐
│ bx: │
│ $x │─────────┐
│ br $x, bm, by │ │
└────────────────┘ v
│ ┌────────────────┐
│ │ by: │
│ │ $y │─────────┐
│ │ br $y, bm, bz │ │
│ └────────────────┘ v
│ │ ┌────────────────┐
│ │ │ bz: │
│ │ │ $z │
│ │ │ br bm │
│ │ └────────────────┘
│ │ │
v v v
┌────────────────────────────────────────────────────┐
│ bm: │
│ $r = phi [$x, bx], [$y, by], [$z, bz] │
└────────────────────────────────────────────────────┘
From the bx
block, the execution only jumps to
the bm
block, if $x
is true.
Similarly, from the by
block to jump to
bm
, $y
must be true.
When bz
is reached, $x
and
$y
are guaranteed to be false. Since the
results are ||
-d together and
false || false || z == z
, the result only
depends on the value of $z
.
These observations are what
generateBinaryOperator()
uses during the
creation of the phi
node, which yields the
following result.
$r = phi [true, bx], [true, by], [$z, bz]