Register Allocation

  mov t1, [t2 + 16]

The assembly code generation phase produced abstract assembly code in which the temporaries in the intermediate representation were simply mapped directly to pseudo-registers of the same name. For example, code generation might produce an abstract assembly instruction like the one on the right.

  mov rax, [rbp - 48]
  mov rcx, [rax + 16]
  mov [rbp - 40], rcx

We've seen that we can generate runnable assembly code by allocating temporaries in stack locations, and using real registers to move data off the stack, run each instruction, and then put the temporary values back onto the stack. This is the way that many compilers operate when optimization is turned off. For example, if temporary t1 were allocated at stack location [rbp - 40] and t2 were allocated at [rbp - 48], then the code on the right could be generated.

The goal of register allocation is to store temporaries directly in registers rather than on the stack. This optimization is typically the most valuable optimization a compiler can do, reducing the number of instructions executed, the amount of memory traffic, and the size of the code. For example, if temporary t1 were allocated to register rdx and t2 were allocated to rcx, then the resulting code would be much more efficient:

  mov rdx, [rcx + 16]

Using live variable information

Register allocation is easer and more effective on modern machines with large register sets, but the supply of registers is quite finite. The x86-64 instruction set has about 15 general-purpose registers; modern RISC architectures typically have about twice as many. It is crucial to use registers only to store the values of variables or temporaries that really need to be stored. Therefore, the first step in any reasonablly good register allocation optimization is live variable analysis, which we have discussed earlier.

For example, consider the following sequence of abstract assembly instructions, where the names a–d stand for temporaries. The right column shows the set of live variables, including both temporaries and real machine registers, after each instruction (and before the next one).

Instruction      Live-out
                 a
mov b, a         b, a
add b, 2         b, a
imul b, b        b, a
mov c, b         c, a
lea b, [c+1]     b, a
mov d, a         b, d 
imul d, b        d
mov rax, d       rax
ret

Note that register rax is live because it is the return result for the ret instruction. At most two variables are live at a given time, suggesting that we might be able to convert this to code that uses only two registers. The following mapping shows that this conversion is possible: a ↦ rax, b ↦ rdx, c ↦ rdx, d ↦ rax, resulting in the code below. Notice that because variables a and d are both allocated to the register rax, and variables b and c are both allocated to rdx, three mov instructions become superfluous, and can be removed as indicated. This optimization is called move coalescing. It is an important optimization because it allows us to simplify the code generation phase of the compiler to add extra variables that move coalescing later removes.

mov rdx, rax
add rdx, 2
imul rdx, rdx
mov rdx, rdx
lea rdx, [rdx + 1]
mov rax, rax 
imul rax, rdx
mov rax, rax
ret

In general, it will not be possible to allocate registers to all registers, since the number of live variables at some program points may be too large. Code with a large number of live variables is said to have high register pressure.

Linear-Scan Register Allocation

High-quality register allocation tends to be expensive, but a simple register allocation algorithm can achieve most ot the performance of the more complex algorithms: linear-scan register allocation, developed by Poletto and Sarkar [1]. The algorithm is simple and fast, which makes it particularly suitable for just-in-time compilation of the sort needed by many modern languages. In that setting, compilation is done while the program is running, so a more sophisticated register allocation algorithm may be too expensive to use.

As with the more sophisticated algorithms, linear-scan allocation starts with a live variable analysis. Register allocation is then done on the straight-line code of each function. Each variable (that is, temporary) is live at a subset of locations in the code—its live range; however, to ensure that each variable is assigned to a single register, the variable is treated as live everywhere from the first place it is live to the last. This set of locations is the variable's live interval.

We start the algorithm by sorting the live intervals in order of their first instruction. We keep a second, initially empty, list of the “active” live intervals, which is sorted in order of the intervals' last instruction. Active live intervals correspond to variables to which a register is currently allocated. We loop through the first list, effectively sweeping forward through the instruction sequence. As each interval is considered, any other, active interval whose last instruction has been passed is removed from the active list and its assigned register is made available. The considered interval is then allocated a free register and is added to the active list.

When the live interval begins with a mov instruction, we look for the opportunity to do move coalescing, by assigning the destination of the instruction the same register as the source in the case where the source is a variable whose live interval ends at this instruction.

If no register is available to be allocated to a variable, one of the variables is chosen to be spilled. Poletto and Sarkar spill the active interval with the last end time, as a heuristic to reduce register pressure. They report that there is not much difference between this method and a more complex one in which the active variable with the least usage is spilled.

Some instructions are bound to specific architectural registers, which adds complexity to the algorithm. For example, the ret instruction in the example effectively uses the rax register. A simple way to deal with this issue is to preallocate any such architectural registers so they are not allocated to variables.

Instruction      a b c d   Action
                 ┬         a ↦ rdx
mov b, a         │ ┬       b ↦ rcx
add b, 2         │ │
imul b, b        │ │ 
mov c, b         │ ┴ ┬     c ↦ rdi
lea b, [c+1]     │ ┬ ┴     free rdi
mov d, a         ┴ │   ┬   d ↦ rdi, free rdx
imul d, b          ┴   │   free rcx
mov rax, d             ┴   free rdi
ret

To see the algorithm in action, consider the code above, whose live intervals are shown on the right. Note that the live interval for b is expanded to include an instruction that is not in the live range for b. The actions that the algorithm takes as it scan downward through the live intervals are shown in the right column. The algorithm manages to register-allocate this code using three registers, plus the preallocated rax register.

Poletto and Sarkar report that this fast, simple algorithm usually achieves performance within 12% of optimal register allocation on a machine with 31 general purpose registers, which means that most of the benefit of register allocation has been achieved. However, the benefit of a more complex register allocation algorithm may be larger on machines with fewer registers.

Interference Graphs

We can do a more optimal job of allocating registers if we think about the problem more abstractly. If two variables are present in the same live-out set, they cannot be allocated to the same register. We say that two such variables interfere with each other. Additionally, if a variable is defined (written) by an instruction, it interferes with any other variable in the live-out set.

For example, the code above generates an interference graph like the one on the right.

The interference relationships between variables constitute an interference graph, an undirected graph in which the nodes are variables and the edges represent interference between the connected variables. A valid register allocation is one in which each node of the graph is assigned a register in such a way that no two connected nodes are assigned to the same register. But this is just problem of graph coloring! The question of whether it is possible to register-allocate code using \(k\) registers is just the problem of whether its interference graph can be colored using \(k\) colors. Chaitin et al. [2] first showed that graph coloring can be the basis of an effective register allocation algorithm.

Of course, graph coloring is an NP-complete problem, but in this setting, solving it is practical; because we can spill registers, the problem does not need to be decided precisely, and further, some simple heuristics are effective at achieving a near-optimal coloring.

Graph Coloring and Kempe's Heuristic

Kempe developed (in 1879) a useful heuristic for graph coloring. Suppose that there are \(k\) colors. We call any graph node whose degree is less than \(k\) a low-degree node; conversely, a node with at least \(k\) neighbors is high-degree. If a graph has a low-degree node, then we can reduce the problem of coloring it to the problem of coloring a simpler graph: the same graph, but with that node and all its edges removed. Once the simpler graph is colored, there must be another color to choose for the removed node. Since removing the node reduces the degree of nodes it is connected to, the removal can result in new low-degree nodes in the simpler graph. The process can be iterated, pushing removed nodes onto a stack. If we manage to reduce the graph to one with no more than \(k\) nodes, it is immediately colorable by assigning all the nodes different colors; then, the nodes on the stack are guaranteed to be colorable in turn as they are popped off.

The diagram below illustrates the use of Kempe's heuristic on a simple graph with \(k=3\) colors. Two nodes are removed and pushed onto a stack, reducing the graph to one of three nodes that can be colored immediately. Then the removed nodes are popped from the stack and colored.

Optimistic coloring

In general, of course, Kempe's heuristic fails, because a graph may have no low-degree nodes. This does not mean that the graph cannot be colored, however. It may be possible, or alternatively impossible, as the following diagrams show.

Two graphs with no low-degree nodes (\(k = 3\)). The left one can still be colored, but the right one cannot.

Optimistic coloring sometimes salvages the situation when there are no low-degree nodes. The idea is that when there is no low-degree node, we simply choose a high-degree node optimistically, remove it from the graph, and push it onto the stack. After coloring the rest of the graph, we hope that enough of the node's neighbors share colors that there is still a legal color choice for this node. Trying to reuse colors as much as possible in the rest of the graph obviously increases the chance of success.

Of course, sometimes optimism is not justified. In this case, we spill the optimistically removed node to the stack.

Spilling to Stack

A simple way to handle spilled variables, whether using linear scan allocation or using graph coloring, is to reserve three registers for accessing stack locations. The code is rewritten in the usual way to use these registers, as necessary, to move variable data between the stack and any instruction that needs it. However, this approach can use up valuable registers unnecessarily.

The more precise approach is to rewrite the code around instructions that access spilled variables. The code is rewritten to move data between the stack and the instruction, but instead of using a reserved register, a fresh temporary variable is generated to do the move. A different fresh temporary is generated for use instruction that uses a spilled variable. Because these temporaries have a very short live range, they tend to interfere less with other variables.

For example, consider the abstract assembly instruction add t2, t1. Suppose that temporary t2 is to be spilled to stack location [rbp-16]. We generate a fresh temporary—say, t99—and replace the use of t2 with t99. The following code results:

  mov t99, [rbp-16]
  add t99, t1
  mov [rbp-16], t99

Having rewritten the code in this way, we have to recompute liveness. Since the new variables are used in such a simple way, liveness can be recomputed incrementally from the previous solution.

In the hypothetical limit where all variables have to be spilled, a maximum of three temporaries will be live at any instruction. In this case, register allocation would devolve to the naive algorithm in which all variables are spilling and three registers are used to access the stack.

Notice that there is no point to spilling the freshly introduced temporaries, since they were introduced to handle a spill in the first place. So they should never be selected for spilling, and marked as such when they are created.

Architectural considerations

Precolored nodes

Some instructions are tied to particular registers. For example, the single-argument mul instruction always puts its result into registers rax (low 64 bits) and rdx (high 64 bits). The jcxz reads from the register rcx, as does the loop instruction. To handle these instructions during graph coloring, we add precolored nodes to the interference graph. These nodes represent exactly machine registers. Since they already have colors, they cannot be removed in the above algorithm.

Function Calls

Functions are permitted to destroy caller-save registers, so variables that are live across function calls are treated as interfering with caller-save registers. Therefore, it makes sense to use callee-save registers for such variables, but prefer caller-save registers for variables that are not live across a function call.

Function calls at the assembly level also factor into live variable analysis. Registers used to pass arguments are live-in to call instructions, and registers used to return results are live-in to ret. Since the number of arguments and results depends on the function, the precise set of live-in registers does too.

Move Coalescing

The price of keeping code generation simple is that many extra temporaries are generated, along with often superfluous mov instructions to move data between them. Obviously, we would like to eliminate such instructions if possible, which can be done through move coalescing. The core intuition is simple: given a instruction like mov t1, t2, we allocate temporaries t1 and t2 to the same machine register. The instruction then becomes unnecessary and can be removed.

If two variables are move-related if they are connected by a move but do not interfere with each other; that is, they are either the source or destination of the same move. In this case they can potentially be merged into a single variable and allocated to the same location.

Effect of merging move-related nodes on the interference graph

Conservative Coalescing

Merging nodes changes the interference graph as the figure above suggests. The danger of merging nodes is that merging can turn low-degree nodes into high-degree nodes (with degree at least \(k\)), preventing successful coloring. Hence, compilers usually perform conservative coalescing, in which nodes are merged only if that could not cause a spill. An obvious—but overly conservative—approach would be to avoid creating any high-degree notes. A still conservative but more flexible approach is due to Briggs: merging is performed as long as the resulting merged node has less than \(k\) high-degree neighbors. The rationale is that once the low-degree neighbors of the merged node are removed via Kempe's heuristic, the current node will itself be low-degree. Therefore merging is safe.

To reduce the memory footprint of the program, it is also possible for two spilled variables to share a stack location. This also helps avoid unnecessary moves between spilled variables. Because the number of stack locations is unbounded, this form of move coalescing does not need to be done in a conservative way.

Graph coloring with coalescing

To summarize register allocation using graph coloring with move coalescing, it works as follows:

  1. First, we perform live variable analysis, construct the interference graph, and move-related nodes are identified.
  2. Next, we remove low-degree nodes that are not move-related from the graph and push them onto the stack.
  3. All low-degree nodes are now move-related at this point. We conservatively coalesce them, and rerun step 2 after nodes are merged.
  4. All nodes are now high-degree or move-related, but the move-related nodes cannot be conservatively coalesced. We pick a low-degree move-related node and “freeze” it, abandoning coalescing it. This frees it up to be pushed onto the stack and potentially enables other nodes to be conservatively coalesced.
  5. Now any remaining nodes are high-degree. We use optimistic coloring: we choose one for a potential spill, push it onto the stack and remove it from graph, and repeat all steps starting from step 2.
  6. At this point all nodes are on the stack. We pop them from the stack, assigning colors while preferring caller-save registers. Nodes pushed onto the stack optimistically may fail to be colorable; in this case they are spilled to the stack. If we are using the rewriting approach to spilling, then we go back to step 1, but with the rewritten code that requires at least some additional analysis work.

In practice, this algorithm is much slower than linear scan allocation but it performs near-optimal register allocation. For example, when applied to our code above, we have d and a as move-related node, and d and rax are also move-related. So the algorithm coalesces all three nodes, assigning both d and a register rax. Then variables b and c do not interfere, so they can be assigned to the same register (rcx, say). As shown above, the resulting code only uses two registers and has two fewer mov instructions than the original code.

Chaitin's algorithm has a number of corner cases to worry about; Appel's textbook [4] has a more detailed description of the algorithm, including pseudocode, that may be helpful.

References

  1. Massimiliano Poletto and Vivek Sarkar. Linear Scan Register Allocation. ACM Transactions on Programming Languages and Systems, 21(5), September 1999.
  2. Gregory J. Chaitin, Marc A. Auslander, Ashok K. Chandra, John Cocke, Martin E. Hopkins, and Peter W. Markstein. Register Allocation Via Coloring. Computer Languages, 6(1), 1981.
  3. Gregory J. Chaitin. Register Allocation & Spilling Via Graph Coloring. ACM Sigplan Notices, 17(6), pp. 98–101, 1982.
  4. Andrew Appel. Modern Compiler Implementation in Java, Chapter 11.