The driver is an often forgotten part of the compiler, unknown to many developers. This is the component that prepares the environment in which the compiler runs. Validating the input files, providing the compiler with the required settings for compilation, connecting the different parts of the pipeline and cleaning up artifacts such as temporary files needed during compilation are all the responsibilities of the driver.
Many developers believe that when they compile a source file
with a well-known compiler such as Clang
, they
invoke the compiler, but in reality, they invoke the driver,
which invokes the compiler for them at a later point.
$ clang++ main.cpp
The above command invokes the Clang
driver and
not the compiler. To see the actual invocations that are
performed the -###
option can be passed to the
driver.
$ clang++ -### main.cpp
... clang version 14.0.0 ...
/usr/lib/llvm-14/bin/clang ... -o /tmp/main-7cd060.o -x c++ main.cpp
/usr/bin/ld ... -o a.out /tmp/main-7cd060.o -lstdc++
It is visible that in the background two separate invocations are performed, one for the actual compiler, which produces an object file and one for the linker, which links this object file against the necessary libraries and produces the executable.
Since the compiler is invoked from the command line, it is important to have a user-friendly command line interface. Upon request, the user can be provided with a description of how it works.
void displayHelp() {
std::cout << "Usage:\n"
<< " compiler [options] <source_file>\n\n"
<< "Options:\n"
<< " -h display this message\n"
<< " -o <file> write executable to <file>\n"
<< " -ast-dump print the abstract syntax tree\n"
<< " -res-dump print the resolved syntax tree\n"
<< " -llvm-dump print the llvm module\n";
}
Since cross-source file communication is not supported, the compiler only takes one source file as a parameter and an option to display the help message, change the name of the output executable or print its various intermediate representations.
The different compiler options are provided through command line arguments, so to handle them, the arguments need to be parsed first.
The parsed arguments are stored inside the
CompilerOptions
record, to make passing them
around easier.
struct CompilerOptions {
std::filesystem::path source;
std::filesystem::path output;
bool displayHelp = false;
bool astDump = false;
bool resDump = false;
bool llvmDump = false;
};
The parseArguments()
function iterates through
the command line arguments and populates an instance of
CompilerOptions
. By convention
argv[0]
is the command that is used to invoke
the program, so the argument parser only has to check the
arguments starting with argv[1]
.
CompilerOptions parseArguments(int argc, const char **argv) {
CompilerOptions options;
int idx = 1;
while (idx < argc) {
std::string_view arg = argv[idx];
...
++idx;
}
return options;
}
The first argument without a leading -
symbol
is assumed to be the source file. If the source file is
already parsed when such an argument is encountered, an
error is reported.
CompilerOptions parseArguments(int argc, const char **argv) {
...
while (idx < argc) {
...
if (arg[0] != '-') {
if (!options.source.empty())
error("unexpected argument '" + std::string(arg) + '\'');
options.source = arg;
}
...
}
...
}
Arguments beginning with a
-
symbol are assumed to be options, while every
other argument is unknown.
CompilerOptions parseArguments(int argc, const char **argv) {
...
while (idx < argc) {
...
else {
if (arg == "-h")
options.displayHelp = true;
else if (arg == "-o")
options.output = ++idx >= argc ? "" : argv[idx];
else if (arg == "-ast-dump")
options.astDump = true;
else if (arg == "-res-dump")
options.resDump = true;
else if (arg == "-llvm-dump")
options.llvmDump = true;
else if (arg == "-cfg-dump")
options.cfgDump = true;
else
error("unexpected option '" + std::string(arg) + '\'');
}
...
}
...
}
The only special option is -o
because it is
expected to be followed by another argument that specifies
the name of the output executable. It might happen however
that the user forgot to pass this argument after the option.
To avoid a crash, the argument parser checks if there is one more argument after the option and if it doesn't find one, the name of the output executable is set to the default empty string. Otherwise, the following argument is treated as the output name.
else if (arg == "-o")
options.output = ++idx >= argc ? "" : argv[idx];
If any error is encountered within the driver, it displays a message and exits immediately.
[[noreturn]] void error(std::string_view msg) {
std::cerr << "error: " << msg << '\n';
std::exit(1);
}
After successfully parsing the options, they have to be validated. If the user asks for the help message, it is displayed and the driver exits.
int main(int argc, const char **argv) {
CompilerOptions options = parseArguments(argc, argv);
if (options.displayHelp) {
displayHelp();
return 0;
}
...
}
If a source file is not specified, or it cannot be opened,
the driver exits with an error. Since this language is
your language, the source files are expected to have
the .yl
extension.
int main(int argc, const char **argv) {
...
if (options.source.empty())
error("no source file specified");
if (options.source.extension() != ".yl")
error("unexpected source file extension");
std::ifstream file(options.source);
if (!file)
error("failed to open '" + options.source.string() + '\'');
...
}
After successfully opening the file, the driver reads its content and starts passing it through the compilation pipeline.
int main(int argc, const char **argv) {
...
std::stringstream buffer;
buffer << file.rdbuf();
SourceFile sourceFile = {options.source.c_str(), buffer.str()};
Lexer lexer(sourceFile);
Parser parser(lexer);
...
}
The parser returns the AST and an indicator of whether the
AST is complete or not. If the -ast-dump
option
was specified, the AST is printed, otherwise, if the AST is
incomplete, the compilation cannot be continued.
int main(int argc, const char **argv) {
...
auto [ast, success] = parser.parseSourceFile();
if (options.astDump) {
for (auto &&fn : ast)
fn->dump();
return 0;
}
if (!success)
return 1;
...
}
If the AST is valid, Sema
can be instantiated
and the AST can be resolved. If the
-res-dump
flag was specified, the resolved tree
is printed, otherwise, if the resolution fails, the driver
exits.
int main(int argc, const char **argv) {
...
Sema sema(std::move(ast));
auto resolvedTree = sema.resolveAST();
if (options.resDump) {
for (auto &&fn : resolvedTree)
fn->dump();
return 0;
}
if (resolvedTree.empty())
return 1;
...
}
If AST resolution succeeds, the LLVM IR can be generated
from the resolved tree. If the -llvm-dump
flag
is specified, the module is dumped.
int main(int argc, const char **argv) {
...
Codegen codegen(std::move(resolvedTree), options.source.c_str());
llvm::Module *llvmIR = codegen.generateIR();
if (options.llvmDump) {
llvmIR->dump();
return 0;
}
...
}
To be able to generate the executable, first, the module has
to be stored in a temporary file. The name of this temporary
is the hash of the file path. By convention, an LLVM IR file
has the .ll
extension.
int main(int argc, const char **argv) {
...
std::stringstream path;
path << "tmp-" << std::filesystem::hash_value(options.source) << ".ll";
const std::string &llvmIRPath = path.str();
std::error_code errorCode;
llvm::raw_fd_ostream f(llvmIRPath, errorCode);
llvmIR->print(f, nullptr);
...
}
The reason for choosing the hash of the file as the
temporary file name instead of a shorter name like
tmp.ll
is that if for example, a build system
wants to compile multiple source files in the same folder at
the same time, these tmp.ll
files would
overwrite each other.
Theoretically, these files could overwrite each other too if the same source file is being compiled in the same folder at the same time, though in that case the content of the temporaries still stays the same.
After writing the IR to a file, it gets passed to
Clang
to turn it into a native executable.
int main(int argc, const char **argv) {
...
std::stringstream command;
command << "clang " << llvmIRPath;
if (!options.output.empty())
command << " -o " << options.output;
int ret = std::system(command.str().c_str());
...
}
Finally, the temporary IR file is cleaned up and the driver
exits with the exit code of Clang
.
int main(int argc, const char **argv) {
...
std::filesystem::remove(llvmIRPath);
return ret;
}