Parsing as an interface to LLVM
August 30, 2014  

LLVM contains a useful collection of tools for compiler writers. The tools operate on an SSA-based intermediate representation (IR), and they implement all sorts of compiler analyses and optimizations (e.g., constant propagation, function inlining, strength reduction). Furthermore, LLVM includes back ends for many machine architectures (x86, ARM, etc.).

The only problem is that LLVM is hard to work with for anyone outside of the core development team. For example, you will have a difficult time if you want to use LLVM from any language other than C++, LLVM’s native language. LLVM does contain foreign function interfaces for C, Python, and Ocaml, but they are incomplete and poorly maintained.

You will have difficulties even if you program in C++. LLVM is on a rapid development cycle, so, different OS distributions often have different versions of LLVM, and the LLVM API changes between versions. Any separately distributed project must therefore detect and cope with different LLVM versions. (See here for a discussion of this problem with regards to using LLVM as the back end for GPU shading languages.)

The best solution that I’ve found is to parse the intermediate representation. LLVM was designed from the start to be modular—the various optimizations are transformations that take the IR as input, and produce IR as output. Compilation is just the process of stringing together sequences of transformations.

Crucially, LLVM can dump the IR to a file, in either a machine-readable bitcode, or a human-readable assembly language. This file format is supposed to be self-contained, so it is complete (no missing features), and, as the IR is central to the whole conception of LLVM, it is well-maintained.

The problem, then, is to develope a parser for LLVM IR. LLVM itself has a parser, but it is a hand-written parser and it produces an abstract syntax tree via the internal API, which is what we were trying to avoid in the first place. It would be much better to have a separate language specification in terms of a grammar.

Ironically, LLVM was originally developed using standard lexer and parser generators (bison and flex), but this approach was abandoned in LLVM 2.5, mainly to make it more portable (for Microsoft Windows).

Fortunately, the language was once parseable with standard tools, and, it turns out, it still is. I’ve built a lexer and parser for LLVM IR for one of my projects. They are written in ocamllex and ocamlyacc, which are variants of the standard lex and yacc tools, so it should be easy to port them to whatever language you like. Furthermore, the grammar has no shift/reduce or reduce/reduce conflicts. The actual LLVM parser will reject some programs accepted by the grammar, but that is because it performs quite a few semantic checks. These can be incorporated into the grammar-based parser through semantic actions.

My approach, then, is to write compiler passes that consume LLVM IR via parsing, and produce LLVM IR via pretty-printing. This nicely accommodates languages other than C++, and it improves on using the LLVM API from C++ because internal changes to LLVM data structures and methods often have no effect on the printed LLVM IR, or they do not affect the IR for the particular programs you are interested in. Furthermore, you don’t have to link your program with LLVM, or compile against its header files to use my method.

Finally, the approach is easily testable. Once you have a parser and a pretty printer, you can test your system by checking that parsing and printing produces an output identical to the input. And inputs are easy to come by, as they are produced by the ordinary operation of LLVM.