Low-level optimizations: Register Allocation and Instruction Selection

Register Allocation:  IR assumes infinite number of temporaries. 
Must allocate real register to each temporary.

Assume liveness analysis has been done (by data flow as shown before).


Initial Example:
Live in: s1			Live Out:

1      if s1 >= 0 goto L1	- 
2      s2 <- M[fp]		s2
3      s4 <- s2 + 2		s2 s4
4      s5 <- s4 + s2		s2 s4 s5
5      s6 <- s4 - s2		s5 s6
6      goto L1			s5 s6
7  L1: s2 <- M[fp + 4]		s2 
8      s3 <- M[fp]		s3 s3
9      s5 <- s2 + s3		s3 s5
10     s6 <- M[fp + 8]		s3 s5 s6
11 L2: s7 <- M[fp + 12]		s3 s5 s6 s7
12     s8 <- s3 + s6		s5 s7 s8
13     s9 <- s8 - s5		s7 s8

Live out: s7,s8


Graph Coloring

1. Build an interference graph from liveness data.  One node per temporary,
with an edge between two nodes if corresponding temporaries are live at
the same time. 

2. If we have k registers available, attempt to find a k-coloring of the interference
graph (each color represents a register, and coloring a node corresponds to
assigning that temp to a register).  This is an NP-complete program, so can't
solve exactly.  Instead, use following heuristic:

- Choose a node with degree < k and remove it (remembering it on a stack).
  [If resulting graph is k-colorable, then so is original graph, since we can
   always give the removed node a color different from each of its neighbors.]
- If we can't find such a node, must choose one of the remaining nodes to "spill".
- When graph is reduced to nothing, add nodes back in reverse order, coloring as we go.

3. If one or more spills were needed,  add spill/restore code to original code
and repeat from step 1.

The example above is 4-colorable, but not 3-colorable.  After spilling s5, new code is 3-colorable:

Live in: t1			Live Out:

1      if s1 >= 0 goto L1	- 
2      s2 <- M[fp]		s2
3      s4 <- s2 + 2		s2 s4
4      s5 <- s4 + s2		s2 s4 s5
4a     M[fp+16] <- s5		s2 s4
5      s6 <- s4 - s2		s6
6      goto L1			s6
7  L1: s2 <- M[fp + 4]		s2 
8      s3 <- M[fp]		s3 s3
9      s5 <- s2 + s3		s3 s5
9a     M[fp+16] <- s5		s3
10     s6 <- M[fp + 8]		s3 s6
11 L2: s7 <- M[fp + 12]		s3 s6 s7
12     s8 <- s3 + s6		s7 s8
12a    s5 <- M[fp+16]		s5 s7 s8
13     s9 <- s8 - s5		s7 s8

Live out: s7,s8


Alternative: linear scan (a greedy algorithm) -- see paper.

Real-life Issues: 

- Should *coalesce* nodes that contain the same values after moves.

- All this is intra-procedural.  What happens at procedure boundaries?
   - Fixed calling conventions, e.g., put arguments in certain registers
   - Caller-save vs. callee-save registers.
   - Strong motivation for inlining!


Instruction Scheduling

Program appears sequential, but most processors introduce Instruction Level Parallelism (ILP)
"under the hood," via pipelining, superscalar dispatch, VLIW (e.g., IA64 Itanium), etc.

So can often improve performance by changing instruction order to suit processor. (Some processors,
e.g., recent Pentiums, do dynamic scheduling in hardware -- not much point in doing it in software!)

Example:  On many machines, normal instructions take one cycle, but some instructions, 
like memory loads and FP operations, take multiple cycles.

Suppose all memory loads take 2 cycles to complete instead of one, and consider


1 r2 <- M[r1]
2 r3 <- M[r1+4]
3 r4 <- r2 + r3
4 r5 <- r2 - 1

If executed in this order, will get a "stall" (wasted cycle) after instruction 2, while
processor waits for r3 value to be loaded and made available for instruction 3. So
sequence takes 5 cycles.

1,2,4,3 is a better order, since no stalling is required, and needs only 4 cycles.

Can compute good schedules by building a dependence DAG, annotating each edge
with the latency (extra delay needed) along that edge.  
Then choose an instruction order consistent with
the DAG order that fills latency slots with useful operations.

For above example:


     [1]       [2]
     / \       /
   1/   \1    /1
   /	 \   /
  V	  V V
[4]       [3]


This says 1 must precede 3 and 4, and 2 must precede 3.  Moreover, there
must be one spare cycle between 1 and 3, 1 and 4, or 2 and 3.