The first step of the compilation process is to take the textual representation of the program and break it down into a list of tokens. Like spoken languages have sentences that are composed of nouns, verbs, adjectives, etc., programming languages similarly are composed of a set of tokens.
┌────┐ ┌──────┐ ┌───┐ ┌───┐ ┌───┐ ┌──────┐ ┌───┐ ┌───┐ ┌─────┐
│ fn │ │ main │ │ ( │ │ ) │ │ : │ │ void │ │ { │ │ } │ │ EOF │
└────┘ └──────┘ └───┘ └───┘ └───┘ └──────┘ └───┘ └───┘ └─────┘
The above function is called main
, but it could
be named anything else like foo
or
bar
. One thing these names have in common is
that each of them uniquely identifies the given function, so
the token that represents such a piece of source code is
called the Identifier
token.
enum class TokenKind : char {
Identifier
};
fn
and void
are reserved keywords,
which means they are only allowed to have one meaning each.
fn
can only mean the beginning of a function
definition, while void
can only mean the void
type. Programmers are not allowed to create variables or
functions called fn
or void
.
Each keyword gets its unique token so that it's easy to differentiate between them.
enum class TokenKind : char {
...
KwFn,
KwVoid,
};
The textual representation of each keyword is also mapped to
the
TokenKind
that represents it.
const std::unordered_map<std::string_view, TokenKind> keywords = {
{"fn", TokenKind::KwFn}, {"void", TokenKind::KwVoid}};
The rest of the tokens, including EOF
are
tokens composed of a single character. To make creating them
easier, each of these tokens is placed into an array and
their respective enumerator values are the ASCII code of
their corresponding character.
constexpr char singleCharTokens[] = {'\0', '(', ')', '{', '}', ':'};
enum class TokenKind : char {
...
Eof = singleCharTokens[0],
Lpar = singleCharTokens[1],
Rpar = singleCharTokens[2],
Lbrace = singleCharTokens[3],
Rbrace = singleCharTokens[4],
Colon = singleCharTokens[5],
};
A developer might write something in the source code that
cannot be represented by any of the known tokens. In such
cases an Unk
token is used, that represents
every unknown piece of source code.
enum class TokenKind : char {
Unk = -128,
...
};
To be able to work with tokens more effectively, besides
their kind, their source location and in some cases their
textual values are also stored. There is no need to store
the textual value of tokens like KwVoid
or
KwFn
, as they are already known, but for
Identifier
the compiler might want to use that
information later.
struct Token {
SourceLocation location;
TokenKind kind;
std::optional<std::string> value = std::nullopt;
};
The SourceLocation
of the token is described by
the path of the source file along with the line and column
index of the token in that file.
struct SourceLocation {
std::string_view filepath;
int line;
int col;
};
The lexer is the part of the compiler that is responsible for producing the tokens. It iterates over a source file character by character and does its best to select the correct token for each piece of code.
Within the compiler, a source file is represented by its path and a buffer filled with its content.
struct SourceFile {
std::string_view path;
std::string buffer;
};
Within the buffer
, the lexer always points to
the character that is to be processed next, while
maintaining the correct line and column indices as it
traverses the buffer. Because initially none of the
characters in the source file is processed, the lexer points
to the first character of the buffer and starts at the
position of line 1 column 0, or in other words, before the
first character of the first line. The next
Token
is returned on demand by the
getNextToken()
method.
class Lexer {
const SourceFile *source;
size_t idx = 0;
int line = 1;
int column = 0;
public:
explicit Lexer(const SourceFile &source) : source(&source) {}
Token getNextToken();
};
To make the processing of the file more convenient, the
peekNextChar()
and
eatNextChar()
helper methods are introduced.
The former returns which character is to be processed next,
while the latter returns that character and advances the
lexer to the next character while updating the correct line
and column position in the source file.
class Lexer {
...
char peekNextChar() const { return source->buffer[idx]; }
char eatNextChar() {
++column;
if (source->buffer[idx] == '\n') {
++line;
column = 0;
}
return source->buffer[idx++];
}
...
};
The following figure showcases how the lexer processes a file. Notice that the current source location is always 1 step behind the character the lexer points to.
'^' marks the next character to be processed next
'v' marks the source location the lexer is standing on
v {line: 1, column: 0}
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ f │ │ n │ │ │ │ m │ │ a │ │ i │ │ n │ ...
└───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘
^
eatNextChar()
v {line: 1, column: 1}
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ f │ │ n │ │ │ │ m │ │ a │ │ i │ │ n │ │ ( │ ...
└───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘
^
eatNextChar()
v {line: 1, column: 2}
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ n │ │ │ │ m │ │ a │ │ i │ │ n │ │ ( │ │ ) │ ...
└───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘
^
The logic for creating the tokens lives in the
getNextToken()
method. When invoked, it skips
every whitespace and stores the location of the first
non-whitespace character, which is the beginning of the
token.
bool isSpace(char c) {
return c == ' ' || c == '\f' || c == '\n' || c == '\r' || c == '\t' ||
c == '\v';
}
Token Lexer::getNextToken() {
char currentChar = eatNextChar();
while (isSpace(currentChar))
currentChar = eatNextChar();
SourceLocation tokenStartLocation{source->path, line, column};
...
}
A for
loop is used to iterate over the
single-character tokens array and if the current character
matches one of them, the corresponding token is returned.
This is the benefit of storing the characters in an array
and making their corresponding TokenKind
have
the value of the ASCII code of the character the token
represents. This way the TokenKind
can
immediately be returned with a simple cast.
Token Lexer::getNextToken() {
...
for (auto &&c : singleCharTokens)
if (c == currentChar)
return Token{tokenStartLocation, static_cast<TokenKind>(c)};
...
}
Design Note
In production-grade compilers, single-character tokens are usually handled using hardcoded branches, as that will lead to the fastest running code in general.
if (currentChar == '(') return Token{tokenStartLocation, TokenKind::lpar}; if (currentChar == ')') return Token{tokenStartLocation, TokenKind::rpar}; if (currentChar == '{') return Token{tokenStartLocation, TokenKind::lbrace}; if (currentChar == '}') return Token{tokenStartLocation, TokenKind::rbrace}; if (currentChar == ':') return Token{tokenStartLocation, TokenKind::colon}; if (currentChar == '\0') return Token{tokenStartLocation, TokenKind::eof};
In this compiler, the goal is to use a representation that takes as little boilerplate code to implement and extend as possible.
For the compiler, comment messages don't carry any important information, so they are skipped by the lexer.
// this is a comment
A comment starts with a '/'
followed by another
'/'
and lasts until the end of the line. This
is the case when peekNextChar()
comes in handy
to look one character ahead.
If the lexer is at the beginning of a comment, it eats all
the characters before the end of the current line or the end
of the file and calls
getNextToken()
recursively to return the first
token after the comment.
Token TheLexer::getNextToken() {
...
if (currentChar == '/' && peekNextChar() == '/') {
while (peekNextChar() != '\n' && peekNextChar() != '\0')
eatNextChar();
return getNextToken();
}
...
}
Design Note
While comments are not important for this compiler, other compilers that convert one language to another (e.g.: Java to Kotlin) or formatting tools do need to know about them. In such cases, the lexer might return a dedicated
Comment
token with the contents of the comment.
Identifiers consist of multiple characters in the form of
(a-z|A-Z)(a-z|A-Z|0-9)*
. Initially, keywords
are also lexed as identifiers but later their corresponding
TokenKind
is looked up from the map and the
correct token representing them is returned.
bool isAlpha(char c) {
return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z');
}
bool isNum(char c) { return '0' <= c && c <= '9'; }
bool isAlnum(char c) { return isAlpha(c) || isNum(c); }
Token TheLexer::getNextToken() {
...
if (isAlpha(currentChar)) {
std::string value{currentChar};
while (isAlnum(peekNextChar()))
value += eatNextChar();
if (keywords.count(value))
return Token{tokenStartLocation, keywords.at(value), std::move(value)};
return Token{tokenStartLocation, TokenKind::Identifier, std::move(value)};
}
...
}
Notice how isSpace
, isAlpha
, etc.
are all custom functions when the C++ standard library also
provides std::isspace
,
std::isalpha
, etc.
These functions are dependent on the current locale, so if
for example
'a'
is not considered alphabetic in the current
locale, the lexer will no longer work as expected.
If none of the above conditions matches the current
character and the end of the function is reached, the lexer
can't figure out which token represents the piece of code
starting at the current character, so an
Unk
token is returned.
Token Lexer::getNextToken() {
...
return Token{tokenStartLocation, TokenKind::Unk};
}