mov t1, [t2 + 16]
The assembly code generation phase produced abstract assembly code in which the temporaries in the intermediate representation were simply mapped directly to pseudo-registers of the same name. For example, code generation might produce an abstract assembly instruction like the one on the right.
mov rax, [rbp - 48] mov rcx, [rax + 16] mov [rbp - 40], rcx
We've seen that we can generate runnable assembly code by allocating
temporaries in stack locations, and using real registers to move data off the
stack, run each instruction, and then put the temporary values back onto the
stack. This is the way that many compilers operate when optimization is turned
off. For example, if temporary t1
were allocated at
stack location
[rbp - 40]
and t2
were allocated at
[rbp - 48]
, then the code on the right could be generated.
The goal of register allocation is to store temporaries directly in registers
rather than on the stack. This optimization is typically the most valuable
optimization a compiler can do, reducing the number of instructions executed,
the amount of memory traffic, and the size of the code. For example, if
temporary t1
were allocated to register
rdx
and t2
were allocated to
rcx
, then the resulting code would be much more efficient:
mov rdx, [rcx + 16]
Register allocation is easer and more effective on modern machines with large register sets, but the supply of registers is quite finite. The x86-64 instruction set has about 15 general-purpose registers; modern RISC architectures typically have about twice as many. It is crucial to use registers only to store the values of variables or temporaries that really need to be stored. Therefore, the first step in any reasonablly good register allocation optimization is live variable analysis, which we have discussed earlier.
For example, consider the following sequence of abstract assembly instructions, where the names a–d stand for temporaries. The right column shows the set of live variables, including both temporaries and real machine registers, after each instruction (and before the next one).
Instruction Live-out a mov b, a b, a add b, 2 b, a imul b, b b, a mov c, b c, a lea b, [c+1] b, a mov d, a b, d imul d, b d mov rax, d rax ret
Note that register rax
is live because it is the return result for
the ret
instruction. At most two variables are live at a given
time, suggesting that we might be able to convert this to code that uses only
two registers. The following mapping shows that this conversion is possible: a ↦ rax, b ↦
rdx, c ↦ rdx, d ↦ rax, resulting in the code below. Notice that because
variables a and d are both allocated to the register rax
,
and variables b and c are both allocated to rdx
,
three mov
instructions become superfluous, and can be removed as
indicated. This optimization is called move coalescing. It is an important
optimization because it allows us to simplify the code generation phase of the
compiler to add extra variables that move coalescing later removes.
mov rdx, rax add rdx, 2 imul rdx, rdxmov rdx, rdxlea rdx, [rdx + 1]mov rax, raximul rax, rdxmov rax, raxret
In general, it will not be possible to allocate registers to all registers, since the number of live variables at some program points may be too large. Code with a large number of live variables is said to have high register pressure.
High-quality register allocation tends to be expensive, but a simple register allocation algorithm can achieve most ot the performance of the more complex algorithms: linear-scan register allocation, developed by Poletto and Sarkar [1]. The algorithm is simple and fast, which makes it particularly suitable for just-in-time compilation of the sort needed by many modern languages. In that setting, compilation is done while the program is running, so a more sophisticated register allocation algorithm may be too expensive to use.
As with the more sophisticated algorithms, linear-scan allocation starts with a live variable analysis. Register allocation is then done on the straight-line code of each function. Each variable (that is, temporary) is live at a subset of locations in the code—its live range; however, to ensure that each variable is assigned to a single register, the variable is treated as live everywhere from the first place it is live to the last. This set of locations is the variable's live interval.
We start the algorithm by sorting the live intervals in order of their first instruction. We keep a second, initially empty, list of the “active” live intervals, which is sorted in order of the intervals' last instruction. Active live intervals correspond to variables to which a register is currently allocated. We loop through the first list, effectively sweeping forward through the instruction sequence. As each interval is considered, any other, active interval whose last instruction has been passed is removed from the active list and its assigned register is made available. The considered interval is then allocated a free register and is added to the active list.
When the live interval begins with a mov
instruction, we look
for the opportunity to do move coalescing, by assigning the destination of the
instruction the same register as the source in the case where the source
is a variable whose live interval ends at this instruction.
If no register is available to be allocated to a variable, one of the variables is chosen to be spilled. Poletto and Sarkar spill the active interval with the last end time, as a heuristic to reduce register pressure. They report that there is not much difference between this method and a more complex one in which the active variable with the least usage is spilled.
Some instructions are bound to specific architectural registers, which adds
complexity to the algorithm. For example, the ret
instruction in
the example effectively uses the rax
register. A simple way to
deal with this issue is to preallocate any such architectural registers so
they are not allocated to variables.
Instruction a b c d Action ┬ a ↦ rdx mov b, a │ ┬ b ↦ rcx add b, 2 │ │ imul b, b │ │ mov c, b │ ┴ ┬ c ↦ rdi lea b, [c+1] │ ┬ ┴ free rdi mov d, a ┴ │ ┬ d ↦ rdi, free rdx imul d, b ┴ │ free rcx mov rax, d ┴ free rdi ret
To see the algorithm in action, consider the code above, whose live
intervals are shown on the right. Note that the live interval for b is
expanded to include an instruction that is not in the live range for
b. The actions that the algorithm takes as it scan downward through the
live intervals are shown in the right column. The algorithm manages to
register-allocate this code using three registers, plus the preallocated
rax
register.
Poletto and Sarkar report that this fast, simple algorithm usually achieves performance within 12% of optimal register allocation on a machine with 31 general purpose registers, which means that most of the benefit of register allocation has been achieved. However, the benefit of a more complex register allocation algorithm may be larger on machines with fewer registers.
We can do a more optimal job of allocating registers if we think about the problem more abstractly. If two variables are present in the same live-out set, they cannot be allocated to the same register. We say that two such variables interfere with each other. Additionally, if a variable is defined (written) by an instruction, it interferes with any other variable in the live-out set.
For example, the code above generates an interference graph like the one on the right.
The interference relationships between variables constitute an interference graph, an undirected graph in which the nodes are variables and the edges represent interference between the connected variables. A valid register allocation is one in which each node of the graph is assigned a register in such a way that no two connected nodes are assigned to the same register. But this is just problem of graph coloring! The question of whether it is possible to register-allocate code using \(k\) registers is just the problem of whether its interference graph can be colored using \(k\) colors. Chaitin et al. [2] first showed that graph coloring can be the basis of an effective register allocation algorithm.
Of course, graph coloring is an NP-complete problem, but in this setting, solving it is practical; because we can spill registers, the problem does not need to be decided precisely, and further, some simple heuristics are effective at achieving a near-optimal coloring.
Kempe developed (in 1879) a useful heuristic for graph coloring. Suppose that there are \(k\) colors. We call any graph node whose degree is less than \(k\) a low-degree node; conversely, a node with at least \(k\) neighbors is high-degree. If a graph has a low-degree node, then we can reduce the problem of coloring it to the problem of coloring a simpler graph: the same graph, but with that node and all its edges removed. Once the simpler graph is colored, there must be another color to choose for the removed node. Since removing the node reduces the degree of nodes it is connected to, the removal can result in new low-degree nodes in the simpler graph. The process can be iterated, pushing removed nodes onto a stack. If we manage to reduce the graph to one with no more than \(k\) nodes, it is immediately colorable by assigning all the nodes different colors; then, the nodes on the stack are guaranteed to be colorable in turn as they are popped off.
The diagram below illustrates the use of Kempe's heuristic on a simple graph with \(k=3\) colors. Two nodes are removed and pushed onto a stack, reducing the graph to one of three nodes that can be colored immediately. Then the removed nodes are popped from the stack and colored.
In general, of course, Kempe's heuristic fails, because a graph may have no low-degree nodes. This does not mean that the graph cannot be colored, however. It may be possible, or alternatively impossible, as the following diagrams show.
Two graphs with no low-degree nodes (\(k = 3\)). The left one can still be colored, but the right one cannot.
Optimistic coloring sometimes salvages the situation when there are no low-degree nodes. The idea is that when there is no low-degree node, we simply choose a high-degree node optimistically, remove it from the graph, and push it onto the stack. After coloring the rest of the graph, we hope that enough of the node's neighbors share colors that there is still a legal color choice for this node. Trying to reuse colors as much as possible in the rest of the graph obviously increases the chance of success.
Of course, sometimes optimism is not justified. In this case, we spill the optimistically removed node to the stack.
A simple way to handle spilled variables, whether using linear scan allocation or using graph coloring, is to reserve three registers for accessing stack locations. The code is rewritten in the usual way to use these registers, as necessary, to move variable data between the stack and any instruction that needs it. However, this approach can use up valuable registers unnecessarily.
The more precise approach is to rewrite the code around instructions that access spilled variables. The code is rewritten to move data between the stack and the instruction, but instead of using a reserved register, a fresh temporary variable is generated to do the move. A different fresh temporary is generated for use instruction that uses a spilled variable. Because these temporaries have a very short live range, they tend to interfere less with other variables.
For example, consider the abstract assembly instruction add t2,
t1
. Suppose that temporary t2
is to be spilled to stack
location [rbp-16]
. We generate a fresh temporary—say,
t99
—and replace the use of t2
with t99
.
The following code results:
mov t99, [rbp-16] add t99, t1 mov [rbp-16], t99
Having rewritten the code in this way, we have to recompute liveness. Since the new variables are used in such a simple way, liveness can be recomputed incrementally from the previous solution.
In the hypothetical limit where all variables have to be spilled, a maximum of three temporaries will be live at any instruction. In this case, register allocation would devolve to the naive algorithm in which all variables are spilling and three registers are used to access the stack.
Notice that there is no point to spilling the freshly introduced temporaries, since they were introduced to handle a spill in the first place. So they should never be selected for spilling, and marked as such when they are created.
Some instructions are tied to particular registers. For example, the
single-argument mul
instruction always puts its result into
registers rax
(low 64 bits) and rdx
(high 64
bits). The jcxz
reads from the register rcx
,
as does the loop
instruction. To handle these instructions
during graph coloring, we add precolored nodes
to the interference graph. These nodes represent exactly machine
registers. Since they already have colors, they cannot be removed in the
above algorithm.
Functions are permitted to destroy caller-save registers, so variables that are live across function calls are treated as interfering with caller-save registers. Therefore, it makes sense to use callee-save registers for such variables, but prefer caller-save registers for variables that are not live across a function call.
Function calls at the assembly level also factor into live
variable analysis. Registers used to pass arguments are live-in to
call
instructions, and registers used to return results are
live-in to ret
. Since the number of arguments and results
depends on the function, the precise set of live-in registers does too.
The price of keeping code generation simple is that many extra temporaries are
generated, along with often superfluous mov
instructions to move
data between them. Obviously, we would like to eliminate such instructions if
possible, which can be done through move coalescing. The core
intuition is simple: given a instruction like mov t1, t2
, we
allocate temporaries t1 and t2 to the same machine register. The instruction then becomes
unnecessary and can be removed.
If two variables are move-related if they are connected by a move but do not interfere with each other; that is, they are either the source or destination of the same move. In this case they can potentially be merged into a single variable and allocated to the same location.
Effect of merging move-related nodes on the interference graph
Merging nodes changes the interference graph as the figure above suggests. The danger of merging nodes is that merging can turn low-degree nodes into high-degree nodes (with degree at least \(k\)), preventing successful coloring. Hence, compilers usually perform conservative coalescing, in which nodes are merged only if that could not cause a spill. An obvious—but overly conservative—approach would be to avoid creating any high-degree notes. A still conservative but more flexible approach is due to Briggs: merging is performed as long as the resulting merged node has less than \(k\) high-degree neighbors. The rationale is that once the low-degree neighbors of the merged node are removed via Kempe's heuristic, the current node will itself be low-degree. Therefore merging is safe.
To reduce the memory footprint of the program, it is also possible for two spilled variables to share a stack location. This also helps avoid unnecessary moves between spilled variables. Because the number of stack locations is unbounded, this form of move coalescing does not need to be done in a conservative way.
To summarize register allocation using graph coloring with move coalescing, it works as follows:
In practice, this algorithm is much slower than linear scan allocation but it
performs near-optimal register allocation. For example, when applied to our
code above, we have d
and a
as move-related node, and
d
and rax
are also move-related. So the algorithm
coalesces all three nodes, assigning both d
and a
register rax
. Then variables b
and c
do
not interfere, so they can be assigned to the same register (rcx
,
say). As shown above, the resulting code only uses two registers and has two fewer
mov
instructions than the original code.
Chaitin's algorithm has a number of corner cases to worry about; Appel's textbook [4] has a more detailed description of the algorithm, including pseudocode, that may be helpful.