Tokenization

The first step of the compilation process is to take the textual representation of the program and break it down into a list of tokens. Like spoken languages have sentences that are composed of nouns, verbs, adjectives, etc., programming languages similarly are composed of a set of tokens.

┌────┐ ┌──────┐ ┌───┐ ┌───┐ ┌───┐ ┌──────┐ ┌───┐ ┌───┐ ┌─────┐
│ fn │ │ main │ │ ( │ │ ) │ │ : │ │ void │ │ { │ │ } │ │ EOF │
└────┘ └──────┘ └───┘ └───┘ └───┘ └──────┘ └───┘ └───┘ └─────┘

The above function is called main, but it could be named anything else like foo or bar. One thing these names have in common is that each of them uniquely identifies the given function, so the token that represents such a piece of source code is called the Identifier token.

enum class TokenKind : char {
  Identifier
};

fn and void are reserved keywords, which means they are only allowed to have one meaning each. fn can only mean the beginning of a function definition, while void can only mean the void type. Programmers are not allowed to create variables or functions called fn or void.

Each keyword gets its unique token so that it's easy to differentiate between them.

enum class TokenKind : char {
  ...

  KwFn,
  KwVoid,
};

The textual representation of each keyword is also mapped to the TokenKind that represents it.

const std::unordered_map<std::string_view, TokenKind> keywords = {
    {"fn", TokenKind::KwFn}, {"void", TokenKind::KwVoid}};

The rest of the tokens, including EOF are tokens composed of a single character. To make creating them easier, each of these tokens is placed into an array and their respective enumerator values are the ASCII code of their corresponding character.

constexpr char singleCharTokens[] = {'\0', '(', ')', '{', '}', ':'};
                  
enum class TokenKind : char {
  ...

  Eof = singleCharTokens[0],
  Lpar = singleCharTokens[1],
  Rpar = singleCharTokens[2],
  Lbrace = singleCharTokens[3],
  Rbrace = singleCharTokens[4],
  Colon = singleCharTokens[5],
};

A developer might write something in the source code that cannot be represented by any of the known tokens. In such cases an Unk token is used, that represents every unknown piece of source code.

enum class TokenKind : char {
  Unk = -128,
  ...
};

To be able to work with tokens more effectively, besides their kind, their source location and in some cases their textual values are also stored. There is no need to store the textual value of tokens like KwVoid or KwFn, as they are already known, but for Identifier the compiler might want to use that information later.

struct Token {
  SourceLocation location;
  TokenKind kind;
  std::optional<std::string> value = std::nullopt;
};

The SourceLocation of the token is described by the path of the source file along with the line and column index of the token in that file.

struct SourceLocation {
  std::string_view filepath;
  int line;
  int col;
};

The Lexer

The lexer is the part of the compiler that is responsible for producing the tokens. It iterates over a source file character by character and does its best to select the correct token for each piece of code.

Within the compiler, a source file is represented by its path and a buffer filled with its content.

struct SourceFile {
  std::string_view path;
  std::string buffer;
};

Within the buffer, the lexer always points to the character that is to be processed next, while maintaining the correct line and column indices as it traverses the buffer. Because initially none of the characters in the source file is processed, the lexer points to the first character of the buffer and starts at the position of line 1 column 0, or in other words, before the first character of the first line. The next Token is returned on demand by the getNextToken() method.

class Lexer {
  const SourceFile *source;
  size_t idx = 0;

  int line = 1;
  int column = 0;

  public:
  explicit Lexer(const SourceFile &source) : source(&source) {}
  Token getNextToken();
};

To make the processing of the file more convenient, the peekNextChar() and eatNextChar() helper methods are introduced. The former returns which character is to be processed next, while the latter returns that character and advances the lexer to the next character while updating the correct line and column position in the source file.

class Lexer {
  ...

  char peekNextChar() const { return source->buffer[idx]; }
  char eatNextChar() {
    ++column;

    if (source->buffer[idx] == '\n') {
      ++line;
      column = 0;
    }

    return source->buffer[idx++];
  }

  ...
};

The following figure showcases how the lexer processes a file. Notice that the current source location is always 1 step behind the character the lexer points to.

'^' marks the next character to be processed next
'v' marks the source location the lexer is standing on

  v {line: 1, column: 0}
      ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
      │ f │ │ n │ │   │ │ m │ │ a │ │ i │ │ n │ ...
      └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘
        ^

eatNextChar()

  v {line: 1, column: 1}
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ f │ │ n │ │   │ │ m │ │ a │ │ i │ │ n │ │ ( │ ...
└───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘
        ^
    
eatNextChar()

  v {line: 1, column: 2}
┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐ ┌───┐
│ n │ │   │ │ m │ │ a │ │ i │ │ n │ │ ( │ │ ) │ ...
└───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘ └───┘
        ^

The logic for creating the tokens lives in the getNextToken() method. When invoked, it skips every whitespace and stores the location of the first non-whitespace character, which is the beginning of the token.

bool isSpace(char c) {
  return c == ' ' || c == '\f' || c == '\n' || c == '\r' || c == '\t' ||
         c == '\v';
}

Token Lexer::getNextToken() {
  char currentChar = eatNextChar();

  while (isSpace(currentChar))
    currentChar = eatNextChar();

  SourceLocation tokenStartLocation{source->path, line, column};

  ...
}

A for loop is used to iterate over the single-character tokens array and if the current character matches one of them, the corresponding token is returned. This is the benefit of storing the characters in an array and making their corresponding TokenKind have the value of the ASCII code of the character the token represents. This way the TokenKind can immediately be returned with a simple cast.

Token Lexer::getNextToken() {
  ...

  for (auto &&c : singleCharTokens)
    if (c == currentChar)
      return Token{tokenStartLocation, static_cast<TokenKind>(c)};

  ...
}

Design Note

In production-grade compilers, single-character tokens are usually handled using hardcoded branches, as that will lead to the fastest running code in general.

if (currentChar == '(')
  return Token{tokenStartLocation, TokenKind::lpar};

if (currentChar == ')')
  return Token{tokenStartLocation, TokenKind::rpar};

if (currentChar == '{')
  return Token{tokenStartLocation, TokenKind::lbrace};

if (currentChar == '}')
  return Token{tokenStartLocation, TokenKind::rbrace};

if (currentChar == ':')
  return Token{tokenStartLocation, TokenKind::colon};

if (currentChar == '\0')
  return Token{tokenStartLocation, TokenKind::eof};

In this compiler, the goal is to use a representation that takes as little boilerplate code to implement and extend as possible.

Handling Comments

For the compiler, comment messages don't carry any important information, so they are skipped by the lexer.

// this is a comment

A comment starts with a '/' followed by another '/' and lasts until the end of the line. This is the case when peekNextChar() comes in handy to look one character ahead.

If the lexer is at the beginning of a comment, it eats all the characters before the end of the current line or the end of the file and calls getNextToken() recursively to return the first token after the comment.

Token TheLexer::getNextToken() {
  ...

  if (currentChar == '/' && peekNextChar() == '/') {
    while (peekNextChar() != '\n' && peekNextChar() != '\0')
      eatNextChar();

    return getNextToken();
  }

  ...
}

Design Note

While comments are not important for this compiler, other compilers that convert one language to another (e.g.: Java to Kotlin) or formatting tools do need to know about them. In such cases, the lexer might return a dedicated Comment token with the contents of the comment.

Identifiers and Keywords

Identifiers consist of multiple characters in the form of (a-z|A-Z)(a-z|A-Z|0-9)*. Initially, keywords are also lexed as identifiers but later their corresponding TokenKind is looked up from the map and the correct token representing them is returned.

bool isAlpha(char c) {
  return ('a' <= c && c <= 'z') || ('A' <= c && c <= 'Z');
}
bool isNum(char c) { return '0' <= c && c <= '9'; }
bool isAlnum(char c) { return isAlpha(c) || isNum(c); }

Token TheLexer::getNextToken() {
  ...

  if (isAlpha(currentChar)) {
    std::string value{currentChar};

    while (isAlnum(peekNextChar()))
      value += eatNextChar();

    if (keywords.count(value))
      return Token{tokenStartLocation, keywords.at(value), std::move(value)};

    return Token{tokenStartLocation, TokenKind::Identifier, std::move(value)};
  }

  ...
}

Notice how isSpace, isAlpha, etc. are all custom functions when the C++ standard library also provides std::isspace, std::isalpha, etc.

These functions are dependent on the current locale, so if for example 'a' is not considered alphabetic in the current locale, the lexer will no longer work as expected.

If none of the above conditions matches the current character and the end of the function is reached, the lexer can't figure out which token represents the piece of code starting at the current character, so an Unk token is returned.

Token Lexer::getNextToken() {
  ...

  return Token{tokenStartLocation, TokenKind::Unk};
}