The Compiler Driver

The driver is an often forgotten part of the compiler, unknown to many developers. This is the component that prepares the environment in which the compiler runs. Validating the input files, providing the compiler with the required settings for compilation, connecting the different parts of the pipeline and cleaning up artifacts such as temporary files needed during compilation are all the responsibilities of the driver.

Many developers believe that when they compile a source file with a well-known compiler such as Clang, they invoke the compiler, but in reality, they invoke the driver, which invokes the compiler for them at a later point.

$ clang++ main.cpp

The above command invokes the Clang driver and not the compiler. To see the actual invocations that are performed the -### option can be passed to the driver.

$ clang++ -### main.cpp
... clang version 14.0.0 ...
/usr/lib/llvm-14/bin/clang ... -o /tmp/main-7cd060.o -x c++ main.cpp
/usr/bin/ld ... -o a.out /tmp/main-7cd060.o -lstdc++

It is visible that in the background two separate invocations are performed, one for the actual compiler, which produces an object file and one for the linker, which links this object file against the necessary libraries and produces the executable.

Command Line Interface

Since the compiler is invoked from the command line, it is important to have a user-friendly command line interface. Upon request, the user can be provided with a description of how it works.

void displayHelp() {
  std::cout << "Usage:\n"
            << "  compiler [options] <source_file>\n\n"
            << "Options:\n"
            << "  -h           display this message\n"
            << "  -o <file>    write executable to <file>\n"
            << "  -ast-dump    print the abstract syntax tree\n"
            << "  -res-dump    print the resolved syntax tree\n"
            << "  -llvm-dump   print the llvm module\n";
}

Since cross-source file communication is not supported, the compiler only takes one source file as a parameter and an option to display the help message, change the name of the output executable or print its various intermediate representations.

Argument Parsing

The different compiler options are provided through command line arguments, so to handle them, the arguments need to be parsed first.

The parsed arguments are stored inside the CompilerOptions record, to make passing them around easier.

struct CompilerOptions {
  std::filesystem::path source;
  std::filesystem::path output;
  bool displayHelp = false;
  bool astDump = false;
  bool resDump = false;
  bool llvmDump = false;
};

The parseArguments() function iterates through the command line arguments and populates an instance of CompilerOptions. By convention argv[0] is the command that is used to invoke the program, so the argument parser only has to check the arguments starting with argv[1].

CompilerOptions parseArguments(int argc, const char **argv) {
  CompilerOptions options;

  int idx = 1;
  while (idx < argc) {
    std::string_view arg = argv[idx];

    ...

    ++idx;
  }

  return options;
}

The first argument without a leading - symbol is assumed to be the source file. If the source file is already parsed when such an argument is encountered, an error is reported.

CompilerOptions parseArguments(int argc, const char **argv) {
  ...

  while (idx < argc) {
    ...

    if (arg[0] != '-') {
      if (!options.source.empty())
        error("unexpected argument '" + std::string(arg) + '\'');

      options.source = arg;
    } 

    ...
  }

  ...
}

Arguments beginning with a - symbol are assumed to be options, while every other argument is unknown.

CompilerOptions parseArguments(int argc, const char **argv) {
  ...

  while (idx < argc) {
    ...

    else {
      if (arg == "-h")
        options.displayHelp = true;
      else if (arg == "-o")
        options.output = ++idx >= argc ? "" : argv[idx];
      else if (arg == "-ast-dump")
        options.astDump = true;
      else if (arg == "-res-dump")
        options.resDump = true;
      else if (arg == "-llvm-dump")
        options.llvmDump = true;
      else if (arg == "-cfg-dump")
        options.cfgDump = true;
      else
        error("unexpected option '" + std::string(arg) + '\'');
    }

    ...
  }

  ...
}

The only special option is -o because it is expected to be followed by another argument that specifies the name of the output executable. It might happen however that the user forgot to pass this argument after the option.

To avoid a crash, the argument parser checks if there is one more argument after the option and if it doesn't find one, the name of the output executable is set to the default empty string. Otherwise, the following argument is treated as the output name.

else if (arg == "-o")
  options.output = ++idx >= argc ? "" : argv[idx];

If any error is encountered within the driver, it displays a message and exits immediately.

[[noreturn]] void error(std::string_view msg) {
  std::cerr << "error: " << msg << '\n';
  std::exit(1);
}

Setting Up Compilation

After successfully parsing the options, they have to be validated. If the user asks for the help message, it is displayed and the driver exits.

int main(int argc, const char **argv) {
  CompilerOptions options = parseArguments(argc, argv);

  if (options.displayHelp) {
    displayHelp();
    return 0;
  }

  ...
}

If a source file is not specified, or it cannot be opened, the driver exits with an error. Since this language is your language, the source files are expected to have the .yl extension.

int main(int argc, const char **argv) {
  ...

  if (options.source.empty())
    error("no source file specified");

  if (options.source.extension() != ".yl")
    error("unexpected source file extension");

  std::ifstream file(options.source);
  if (!file)
    error("failed to open '" + options.source.string() + '\'');

  ...
}

After successfully opening the file, the driver reads its content and starts passing it through the compilation pipeline.

int main(int argc, const char **argv) {
  ...
  std::stringstream buffer;
  buffer << file.rdbuf();
  SourceFile sourceFile = {options.source.c_str(), buffer.str()};

  Lexer lexer(sourceFile);
  Parser parser(lexer);
  ...
}

The parser returns the AST and an indicator of whether the AST is complete or not. If the -ast-dump option was specified, the AST is printed, otherwise, if the AST is incomplete, the compilation cannot be continued.

int main(int argc, const char **argv) {
  ...
  auto [ast, success] = parser.parseSourceFile();

  if (options.astDump) {
    for (auto &&fn : ast)
      fn->dump();
    return 0;
  }

  if (!success)
    return 1;
  ...
}

If the AST is valid, Sema can be instantiated and the AST can be resolved. If the -res-dump flag was specified, the resolved tree is printed, otherwise, if the resolution fails, the driver exits.

int main(int argc, const char **argv) {
  ...
  Sema sema(std::move(ast));
  auto resolvedTree = sema.resolveAST();

  if (options.resDump) {
    for (auto &&fn : resolvedTree)
      fn->dump();
    return 0;
  }

  if (resolvedTree.empty())
    return 1;
  ...
}

If AST resolution succeeds, the LLVM IR can be generated from the resolved tree. If the -llvm-dump flag is specified, the module is dumped.

int main(int argc, const char **argv) {
  ...
  Codegen codegen(std::move(resolvedTree), options.source.c_str());
  llvm::Module *llvmIR = codegen.generateIR();

  if (options.llvmDump) {
    llvmIR->dump();
    return 0;
  }
  ...
}

To be able to generate the executable, first, the module has to be stored in a temporary file. The name of this temporary is the hash of the file path. By convention, an LLVM IR file has the .ll extension.

int main(int argc, const char **argv) {
  ...
  std::stringstream path;
  path << "tmp-" << std::filesystem::hash_value(options.source) << ".ll";
  const std::string &llvmIRPath = path.str();

  std::error_code errorCode;
  llvm::raw_fd_ostream f(llvmIRPath, errorCode);
  llvmIR->print(f, nullptr);
  ...
}

The reason for choosing the hash of the file as the temporary file name instead of a shorter name like tmp.ll is that if for example, a build system wants to compile multiple source files in the same folder at the same time, these tmp.ll files would overwrite each other.

Theoretically, these files could overwrite each other too if the same source file is being compiled in the same folder at the same time, though in that case the content of the temporaries still stays the same.

After writing the IR to a file, it gets passed to Clang to turn it into a native executable.

int main(int argc, const char **argv) {
  ...
  std::stringstream command;
  command << "clang " << llvmIRPath;
  if (!options.output.empty())
    command << " -o " << options.output;

  int ret = std::system(command.str().c_str());
  ...
}

Finally, the temporary IR file is cleaned up and the driver exits with the exit code of Clang.

int main(int argc, const char **argv) {
  ...
  std::filesystem::remove(llvmIRPath);

  return ret;
}