Making Interpreters Efficient

- VM programs have an explicitly specified binary representation, typically called bytecode.
- Most VM's can execute bytecode directly by interpretation.
- Interpretation is typically 1-2 orders of magnitude slower than compilation (but of course this depends on interpreter, compiler, target machine)
- So serious VM's usually do JIT compilation too
- Still, it is worthwhile to make interpreters efficient
- But it is also desirable to keep them portable

Target Hardware: Facts of Life

Accessing memory is slow!
- Even a L1 cache hit typically costs several cycles.
- Cache misses cost 10's-100's of cycles.
- Must try to keep data in registers if possible.
- Small changes in data, code layout can have big effects
Machines are deeply pipelined!
- Conditional branches are bad.
- Dynamic-target branches are probably worse.
- Should try to utilize hardware support tricks (e.g. branch target buffers).

Costs of Interpretation

Interpreting an instruction requires:
- Dispatching the instruction: getting control to the code corresponding to the instruction
- Accessing the operands: getting the values of the parameters and arguments (and storing the result)
- Actually performing the computation. (Note: interpreters win more when this is slow!)
**Naive Interpreter: Sample Instructions**

```c
u4 stack[STACKSIZE];
u4* sp;
void interp (Method *method) {
  u1 *pc = method->code;
  while (1) {
    switch (*pc) {
      case ICONST_0: { *(++sp) = (u4) 0; pc++; break; }
      case ISTORE_0: { locals[0] = *(sp--); pc++; break; }
      case IADD: { int32 v2 = (int32) (*(sp--));
                   int32 v1 = (int32) (*sp);
                   *sp = (u4) (v1 + v2);
                   pc++; break; }
      ...
    }
  }
}
```

**Speeding Up Operand Access**

First, let's consider just the cost of accessing stack elements: loads/stores to memory and sp adjustment.

C code:

```c
case ICONST_0: { *(++sp) = (u4) 0; pc++; break; }
```

X86 (32-bit) machine code (obtained using gcc -S)

```assembly
movl _sp, %eax // load value of sp from C global
leal 4(%eax), %ebx // new sp
movl %ebx, _sp // save new sp value into C global
movl $0, 4(%eax) // *(new sp) = 0
incl -64(%ebp) // pc++
jmp top // break
```

SPARC machine code (obtained using gcc -S)

```assembly
; initially %l2 = %hi(sp) %l1 = pc
ld [%l2+%lo(sp)], %g1 ; load value of sp from C global
add %l1, 1, %l1 ; pc++
add %g1, 1, %o5 ; new sp
st %g0, [%g1+4] ; *(new sp) = 0 (%g0 is always 0)
ba %xcc, top ; (unconditional branch) break
st %o5, [%l2+%lo(sp)] ; DELAY SLOT!: save new sp into C global
```

**Utilizing Registers**

Keeping sp in a global memory location looks like a terrible idea, since it requires one load and one store per bytecode executed.

Let's make it a local of interp instead. Then C compiler should be able to keep it in a register (if there are any available).

New SPARC machine code:

```assembly
; initially %i2 = sp %l1 = pc
add %i2, 4, %i2 ; ++sp
add %l1, 1, %l1 ; pc++
ba %xcc, top ; break
st %g0, [%i2] ; DELAY SLOT!: *(sp) = 0
```

Sparc features:
- load/store architecture
- about 30 registers
- delay slots expose pipeline (a little)
X86 VERSION

New X86 machine code:

// initially -28(%ebp) = sp -64(%ebp) = pc
addl $4, -28(%ebp) // ++sp
movl -28(%ebp), %ecx // load value of sp
movl $0, (%ec) // *sp = 0;
incl -64(%ebp) // pc++
jmp top

Now sp lives in the local stack frame rather than in a global, but there still isn’t a free register for it.

STACK CACHING

Idea: what if we cache the top-of-stack in a local variable s0?
(Assume that sp points to the top of the remainder of the stack.)

This saves one load and one store for IADD:

case IADD: { int32 v2 = (int32) (*(sp--));
          int32 v1 = (int32) (*(sp--));
          s0 = (u4) (v1+v2); pc++; break; }

Approximate SPARC code:

; initially %i2 = sp %i3 = s0 %l1 = pc
ld [%i2], %o5 ; *sp
add %i2, -4, %i2 ; sp--
add %i1, 1, %i1 ; pc++
ld [%i2], %g1 ; *(new sp)
add %g1, %o5, %g1 ; %g1 = computed result
ba %xcc, top ; break
st %g1, [%i2] ; DELAY SLOT!: *(new sp) = result

C code:

case IADD: { int32 v2 = (int32) (*(sp--));
            int32 v1 = (int32) (*(sp--));
            *sp = (u4) (v1 + v2);
            pc++; break; }

SPARC code:

; initially %i2 = sp %l1 = pc
ld [%i2], %o5 ; *sp
add %i2, -4, %i2 ; sp--
add %i1, 1, %i1 ; pc++
ld [%i2], %g1 ; *(new sp)
add %g1, %o5, %g1 ; %g1 = computed result
ba %xcc, top ; break
st %g1, [%i2] ; DELAY SLOT!: *(new sp) = result

But it is a wash for the other two instructions because we have to keep s0 up-to-date.

CACHING ONE SLOT

case ICONST_0: { *(++sp) = s0; s0 = (u4) 0; pc++; break; }

Approximate SPARC code (still one store)

; initially %i2 = sp %i3 = s0 %l1 = pc
add %i2, 4, %i2 ; ++sp
st %i3, [%i2] ; *(sp) = s0
add %i1, 1, %i1 ; pc++
ba %xcc, top ; break
mov %g0, %i3 ; DELAY SLOT!: s0 = 0
CACHING ONE SLOT (CONTINUED)

```java
case ISTORE_0: { locals[0] = s0; s0 = *(sp--); pc++; break; }
```

Approximate SPARC code (still one load and one store)

```java
; initially %i2 = sp %i3 = s0 %l1 = pc %l2 = base of locals
st %i3, [%i2] ; locals[0] = s0
add %l1, 1, %l1 ; pc++
ld [%i2], %i3 ; s0 = *sp
ba %xcc, top ; break
add %i2, -4, %i2 ; DELAY SLOT!: sp--
```

CACHING TWO SLOTS

What if we keep two elements in local variables (registers) named s1 (top of stack) and s0 (next-to-top of stack)?

```java
case ISTORE_0: { locals[0] = s1; s1 = s0; s0 = *(sp--); pc++; break; }
```

```java
case ICONST_0: { *(++sp) = s0; s0 = s1; s1 = (u4) 0; pc++; break; }
```

```java
case IADD: { int32 v2 = (int32) s1; int32 v1 = (int32) s0; s1 = (u4) (v1+v2); s0 = *(sp--); pc++; break; }
```

This just pushes off the problem: no improvement in number of loads and stores needed.

New idea: let’s keep a different number of cached stack slots at different points during execution.

GENERALIZED STACK CACHING

- Interpreter operates in one several different states corresponding to how many stack slots are cached.
- Each instruction (potentially) causes transition to a different state, according to what it does to the stack.
- For example:
  - ICONST_0 moves to a state where more slots are cached;
  - ISTORE_0 moves to one where fewer slots are cached.
  - IADD moves to a state where one slot is cached.

For JVM, 3 states are sufficient to handle all instruction types.

State 0: no slots cached.
State 1: top of stack is cached in variable s0.
State 2: top of stack is cached in variable s1; next-to-top in s0.
In all states, sp points to remainder of stack beyond cached slots.
Sample code follows (in practice we may organize it differently)...

GENERALIZED STACK CACHING (2)
case IADD: {
    switch (state) {
        case 0: { int32 v2 = (int32) (*(sp--)); int32 v1 = (int32) (*(sp--));
            s0 = (u4) (v1+v2); state = 1; break; }
        case 1: { int 32 v2 = (int32) s0; int32 v1 = (int32) (*(sp--));
            s0 = (u4) (v1+v2); state = 1; break; }
        case 2: { int 32 v2 = (int32) s1; int32 v1 = (int32) s0;
            s0 = (u4) (v1+v2); state = 1; break; }
    pc++; break; }
}

Consider a typical expression like
\[ b = a + 3 \]
where we assume \( a \) is local variable 0 and \( b \) is local variable 1.

(Assume we start with state = 0.)

<table>
<thead>
<tr>
<th>Bytecode</th>
<th>Corresponding executed code</th>
</tr>
</thead>
<tbody>
<tr>
<td>ILOAD_0</td>
<td>s0 = locals[0]; state = 1;</td>
</tr>
<tr>
<td>ICONST_3</td>
<td>s1 = 3; state = 2;</td>
</tr>
<tr>
<td>IADD</td>
<td>s0 = s1 + s0; state = 1;</td>
</tr>
<tr>
<td>ISTORE_1</td>
<td>locals[1] = s0; state = 0;</td>
</tr>
</tbody>
</table>

We do only the essential loads and stores – no stack traffic at all!

More generally, instructions are classified by a pair:

 (# of stack slots they consume, # of stack slots they produce)

For example:

- ISTORE_0 1,0
- ICONST_0 0,1
- IADD 2,1

**Dynamic vs. Static Caching**

So far we've described dynamic stack caching, where the interpreter keeps track of its current state.

- In practice, we implement this by having three complete sets of instruction implementations and dispatching to the correct one based on current state as well as opcode (more on this later).
- But it may seem like we should be able to predict the state at each program point statically (before execution). If so, we could simply have three variants of each opcode, and select the right one at compile time. This would be more efficient.
- Only problem: at join points in the code, the state may differ depending on the path by which the join point was reached. Must choose a convention for which state to use there, and add compensation code to the other branches; this is complex in practice.
Nearly all hardware processors use **registers**

- Each HW instruction is parameterized by its argument/result registers.
- Why is this good for hardware? Because the opcode and the argument registers can be decoded in **parallel**, and values can quickly be fetched from a small, fast register file.

Why not try this in software machines too?

- Parameters must be fetched from the byte stream and decoded **serially**; for stack instructions, parameters are implicit.
- Instructions with parameters take more space.
- Software registers cannot easily be stored in hardware registers, because the latter can't be indexed. So software registers end up living in an in-memory array (just like stack slots).
- On the other hand, register architectures require fewer instructions; hence less **dispatch**. So maybe a worthwhile idea after all...

### Indirect Threaded Code

```
interp(Method method) {
    static void *dispatch_table[] = {
        &&NOP,
        &&ACONST_NULL,
        &&ICONST_M1,
        ..., 
        &&JSR_W 
    };
    u1 *pc = method->code;
    goto *(dispatch_table[*pc]);
}
```

### Speeding Up Instruction Dispatch

```
What does SPARC code look like now?

    ; %l4 = table    %l1 = pc
    top: 
        ldub [%l1], %o1 ; *pc
        and %o1, 0xff, %g1 ; (u1) *pc
        cmp %g1, 200 ; if >200
        bgu.pn %icc, undefined ; or <0, branch
        sll %g1, 2, %g1 ; scale
        ld [X%l4+%g1], %o4 ; fetch snippet address
        jmp %o4 ; jump to snippet
        nop

table:
    .word nop_snippet
    .word aconst_null_snippet
    .word iconst_m1_snippet
    .word iconst_0_snippet
    ...
    .word goto_w_snippet

undefined: 
    ...issue error and die...
```
**Using Processor Branch Support**

One extra reason why indirect threaded code improves performance may be that it makes better use of hardware support for branch predication.

Many pipelined processors contain a **branch target buffer** (BTB) that dynamically remembers the last target for each branch instruction (including conditional and indirect branches). The next time the branch instruction is executed, the processor pre-fetches from the address predicted by the buffer.

- A naive interpreter makes **terrible** use of this feature, because a **single** instruction dispatches to all the snippets, so prediction accuracy is \( \approx 0 \).
- The indirect threaded code version does somewhat better, because the dispatches are distributed, and certain bytecode instruction sequences are quite common, so prediction accuracy may be \( > 0 \).

But a fundamental prediction mismatch between the VM and the target hardware remains.

---

**Rewriting Byte Code**

But notice that we’re no longer interpreting the original bytecode any more.

Must rewrite before execution

Simple in principle, but there are details. E.g.

- What should we do with the parameter bytes following the opcode?
  If we’re going to rewrite the bytecode, there are **many** opportunities to improve things, e.g.
- Combine code for similar opcodes (e.g. constant loading).
- Short-circuit constant pool references (important in full language)
- Perform static stack caching
- Etc, etc.

A more radical rewrite idea: dispatch to each snippet using a **subroutine** call instruction. May pay off on processors that pre-fetch from the return address on the hardware stack!

---

**Direct Threading**

Each instruction dispatch still requires two fetches: one to get the byte code and a second to get the snippet address.

New idea: what if we represent each instruction opcode by the address of its snippet?

```c
interp() {
    char *codeaddrs[] = ...; /* fill this with snippet addrs */
    char *pc = codeaddrs; /* initialize to start */
    goto **pc;
    ACONST_NULL:
        *(++sp) = (u4) 0;
        pc++;
        goto **pc;
    ...
}
```

Now need only one fetch per instruction!

---

**Reducing Dispatches**

Another way to reduce dispatch time is to do **fewer** dispatches.

One basic approach is to **combine** sequences of instructions that occur frequently into into “macro” or “super”-instructions.

For example, the following sequence pattern is very common:

```c
ILOAD n
ICONST i
IADD
ISTORE n
```

In fact, the JVM designers already invented a combined instruction for this (IINC) but the same idea works for other sequences.

Another approach is to use a **register** architecture, which typically requires many fewer instructions (although each instruction gets more parameters).
This can be done in several ways:

- **Statically, for multiple programs:**
  - Essentially a refinement of the VM definition, possibly tuned to workload from a particular set of programs.
  - Can construct such specialized VM's semi-automatically from a generic VM.
  - Specialized VM can be compiled with “cross-snippet” optimization.
- **Statically, for a single program**
  - Encoding is sent with the program.

Static encodings also have the benefit of reducing the program size, allowing quicker transmission.

- **Dynamically, by building superinstructions “on the fly” from snippet code.**
  - This is beginning to resemble a compiler!