The LLVM optimization engine

Now that we are a bit familiar with the LLVM IR, we can dig into some of the optimizations that it can perform.

The LLVM optimization engine contains a series of independent passes, which can be run with the opt tool in any order and any number of times. This makes it very easy to experiment with different optimizations or sequences of optimizations and see how they interact.

A comprehensive description of how passes work is provided here. Although the passes can be run independently, they do influence one another (we will see some examples of this later on). Furthermore, some passes don’t perform any transformations, but instead analyze the input code. The results of these analyses are used by the other passes in order to make better decisions. Each pass can specify the analyses that it requires and even other transformation passes that might create more opportunities for it, and they will be run before it. This means that when you run opt with one pass, several passes are actually scheduled. You can see them by using the -debug-pass=Structure option.

We will look at what some of these optimization passes can do, starting with the LLVM vectorizer.

Vectorization

Many modern architectures provide some sort of SIMD (Single Instruction Multiple Data) instructions, such as SSE on x86 or NEON on ARM. These instructions manipulate large registers that contain several items of the same type (for instance, 4 or 8 integer values). The data types that fit into these registers are known as vector types.

In LLVM IR, an element vector is represented as a tuple containing the number of elements in the vector as well as the type of these elements, all enclosed in angular brackets; for instance, <4 x float> represents a vector consisting of 4 float elements.

Most traditional programming languages don’t offer support for vectors - they are usually handled by means of intrinsics or compiler extensions, which damage application portability. Therefore, it usually falls upon the compiler to identify potential locations where vectors can be used. This is what LLVM’s vectorization passes do. There are three distinct passes: one that does loop vectorization, which merges several consecutive iterations of a loop into one vector iteration, and two that perform SLP (Super-word Level Parallelism) vectorization, by merging similar independent scalar instructions.

Dead code elimination

Dead code elimination is in charge of removing the code that is either unreachable (meaning that it doesn’t get run on any possible execution path) or useless (meaning that it does not influence the results of the program in any way).

Getting rid of dead code is important not only for reducing the size of the program, but also because it may change the results of different analyses, thus allowing further optimizations to take place.

Loop unrolling

Loop unrolling is an optimization that replicates the body of a loop a number of times, leading to fewer iterations (and thus fewer branches, which is a good thing on many architectures). Furthermore, since the loop body now contains more instructions, the backend will have more freedom when scheduling them.

Loop unrolling should be handled with care though, because it usually leads to an increase in the code size. Even if you have plenty of memory around, this can still be a bad thing, because the body of the loop may no longer fit into the cache, thus leading to substantial slowdowns. In addition, the body of the loop may now require more registers than are available, leading to registers being spilled into memory.

A related optimization is that of loop rerolling, which has the opposite effect.

Loop invariant code motion

The purpose of the loop invariant code motion optimization is to move as much code as possible outside of loops. Depending on the case, the code may be moved before the loop (in what is called the preheader block) or after the loop. The former usually happens for code that computes values that are used within the loop but that do not change between iterations, whereas the latter is usually seen in situations where only the value computed by the last iteration is visible outside the loop (even if the computations still have to be performed for each iteration, the stores to memory do not).

Loop unswitching

Loop unswitching is a transformation which moves branches that have loop-invariant conditions outside of the loop. Two versions of the loop are then produced - one for when the branch is taken, and one for the other case. The effect is that the branch is taken only once, before the loop, as opposed to once every iteration as in the original case. Furthermore, the control flow of the loop body is simplified, allowing other optimizations to understand and manipulate it better.

Inlining

It is a known fact that function calls incur some overhead (mostly due to stack and register management). However, modern software practices demand a level of abstraction that usually leads to a large number of very small functions - and thus a potentially large overhead from function calls. This overhead may be reduced by inlining some of the functions, i.e. replacing the call with the actual body of the function. This also lays the ground for a wealth of other optimizations, which would otherwise not look beyond function boundaries.

Note however that too much inlining often leads to significant increases in the code size, which comes with a lot of other disadvantages. It is usually performed only for small functions or functions that are not called a large number of times.

Tail recursion elimination

Tail recursion elimination is a special case of tail call elimination. This transformation targets calls that happen just before the return from a function. Since the caller does nothing after the call finishes, its stack frame can be reused. In the case of tail recursion, this means that the call can be replaced with a branch towards the beginning of the function. In effect, this replaces recursion with iteration.

Although in most languages this is considered to be an optional optimization, for some functional languages (such as Scheme) it is mandated by the standard.

Reminders

To obtain an LLVM IR file you can use

clang -S -emit-llvm file.c -o file.ll

To invoke the optimizer

opt -S -O3 file.ll -o file_o3.ll
opt -S -loop-vectorize file.ll -o file_v.ll

To plot the CFG

opt -dot-cfg file.ll # Will generate a .dot file for each function
xdot cfg.<function>.dot

If you want to compare different versions of the CFG it is recommended to rename your .dot files so they don't get overwritten.

Other interesting opt flags:

  • -debug-pass=Structure shows all the passes that were run
  • -print-after-all prints the IR after each pass

Download these snippets.

Task 1

Compare the -O0 and -O3 versions of vectorization/prod.c. Try to run with different vector widths by using -force-vector-width=<vector_width>.

Task 2

Generate the LLVM IR for dead-code/sum.c. Run dead code elimination by specifying -dce to opt. Use -mem2reg option to promote memory references to register references.

Task 3

Generate the LLVM IR for loop-unrolling/week.c. Run loop unrolling by specifying -loop-unroll to opt. Enable loop rotate by adding -loop-rotate to the previous command. Try with different unroll factors by using -unroll-count=<unroll_factor>. Remove redundant branches by running CFG simplification (-simplifycfg).

Task 4

Generate the LLVM IR for loop-invariant/normalize.c. Run loop invariant by specifying -licm to opt.

Task 5

Generate the LLVM IR for loop-unswitching/accumulate.c. Run loop unswitching by specifying -loop-unswitch to opt. Run again with the CFG simplication enabled.

Task 6

Generate the LLVM IR for inlining/sort.c. Enable inlining by specifying -inline to opt.

Task 7

Generate the LLVM IR for tail-elim/factorial.c. Enable tail call elimination by specifying -tailcallelim to opt.

sesiuni/llvm/llvmopt.txt · Last modified: 2015/09/10 11:20 by freescale