Pagini
Workshops
Parteneri
A comprehensive description of the LLVM IR can be found here. We will focus on the human-readable form of the IR, which has many things in common with assembly languages (notice for instance that the control flow is achieved by means of branch instructions) but still maintains a lot of high-level information (notice that all values are typed).
The code is organized as a series of basic blocks, i.e. regions of code that are not interrupted by branches. This format is widely known as a Control Flow Graph (CFG). You can visualize the CFG using xdot - for this you will have to use:
opt -dot-cfg-only file.ll # or opt -dot-cfg file.ll
This will produce a .dot file for every function in the input, which you can then plot by running
xdot cfg.<function>.dot
The -dot-cfg-only flag omits the contents of the basic blocks, showing only the general structure of the function. Each basic block is a node, whereas the edges represent potential execution paths.
If you look at the contents of the basic blocks, you will see a lot of instructions that should be familiar from any assembly language (load, store, add, icmp, br etc). The exact syntax for each of these instructions is explained in the Language Reference. We will explain some of the less intuitive instructions later on.
Before that, a few words about the type system are in order.
The LLVM IR is heavily annotated with type information, and the instructions have very precise requirements with regard to the types of their operands. The types accepted by LLVM include:
A possibly surprising thing about the type system is that the integers are neither signed nor unsigned - instead, the instructions that manipulate them can be signed or unsigned. For instance, the icmp instruction can be used with the slt (signed less than) or the ult (unsigned less than) flag. Not all the instructions that work with integers need to make this distinction - for instance, addition is the same regardless of the sign of the operands thanks to the fact that LLVM uses a two’s complement representation for integers.
If you have compiled the code at -O0, you have probably noticed a lot of alloca instructions. This instruction allocates space on the stack. Clang generates one alloca instruction for each function parameter and local variable - however these tend to disappear at -O3.
Another instruction that may seem confusing is getelementptr - this instruction is so often misunderstood that there exists a special FAQ page dedicated to it . The instruction takes a base pointer and a series of indices and computes an address based on them - without accessing the memory. It can index through arrays and structures (in this case the index would correspond to the structure field), but not through any nested pointers, because those would have to get loaded from memory. So, for instance,
%element.addr = getelementptr {[10 x i64], i32 }* %MyVar, i64 0, i32 0, i64 3
computes the address of the 4th member of the array field of the structure pointed to by %MyVar (quite a mouthful). The first index, 0, indicates that we want the first structure pointed to by %MyVar (think of it as a MyVar[0] in C, which is equivalent to *MyVar). The second index shows the structure field that we wish to access (the array, i.e. the first field). The last index indicates the array member that we want.
If, on the other hand, %MyVar pointed to a structure that contained a pointer instead of an array, we could not compute the address of its fourth member, because that would require a memory access:
%p.addr = getelementptr {i64*, i32}* %MyVar, i64 0, i32 0 %p = load i64** %p.addr %element.addr = getelementptr i64* %p, i64 3
With all this said, we now reach the most unusual instruction in the LLVM IR: the phi instruction, which you may have noticed at the beginning of certain basic blocks. This instruction does not resemble anything found in traditional processor architecture - instead, it is a consequence of a very fundamental property of the LLVM IR: the fact that it is in Static Single Assignment (SSA) form.
The main property of the SSA representation is that each variable is assigned exactly once. This simplifies data flow analysis a great deal because each different value has a different name. For instance, consider the following code snippet:
x = some_value; a = 2 * x; x = x + k; b = 2 * x;
An optimization that would like to compute 2 * x only once and assign it to both a and b would have to first check that the value of x does not change between the two assignments. Let’s see the same snippet in SSA form:
x_1 = some_value; a = 2 * x_1; x_2 = x_1 + k; b = 2 * x_2;
In SSA, x cannot be assigned twice, so a new variable is created when the second assignment is encountered. Now we can tell that a and b take different values without performing any kind of data flow analysis - this information is embedded in the structure of the IR.
The fact that variables can only be assigned once has several implications. Consider what happens when we assign to the same variable on different control flow paths:
if (condition) x = v * 2; else x = u - 1; a = x + 4;
The rule that we cannot assign twice to x still holds, so we will have two variables:
if (condition) x_1 = v * 2; else x_2 = u - 1; a = x_?! + 4;
Obviously the value of a will depend on x_1 if condition holds and on x_2 otherwise. In the SSA form, this is expressed by means of the phi instruction:
if (condition) x_1 = v * 2; else x_2 = u - 1; x_3 = phi (x_1, x_2); a = x_3 + 4;
The phi instruction practically selects a value based on the execution path that was taken at runtime. Note that this does not correspond to any instruction that is usually found in hardware architectures.
In LLVM, phi instructions are easy to follow because their arguments are expressed as pairs containing the value that should be assigned as well as the label identifying the corresponding basic block. Phi instructions can only appear at the beginning of a basic block. You should check out a few simple examples to familiarize yourself with this instruction.
Another implication of the SSA form is related to global variables. Non-constant global variables are usually modified during the execution of the program, potentially across different functions, and obviously creating a new variable for each assignment would not be very lucrative in this case. Therefore, global variables are all represented as pointers, and their values are always manipulated through load/store instructions.
There is a lot of theory out there regarding the SSA form, its varieties and the algorithms used for generating it, maintaining it and getting out of it. In fact, there is even an upcoming book dedicated exclusively to it.
Plot the CFGs for several examples and try to understand where each basic block comes from (you may try -O0 vs -O3 to see if there are any changes). In order to obtain an IR file, you can use
clang -S -emit-llvm file.c -o file.ll
Note that Clang tends to generate a lot of unnecessary alloca instructions, which don’t really help with readability. You can get rid of these by running
opt -S -mem2reg file.ll -o file.ll
You can use xdot for plotting the CFGs:
opt -dot-cfg file.ll xdot cfg.<function>.dot
You can start with the examples that we used for the AST. You can also use these.