EFFICIENT INTERPRETATION ------------------------ What are the costs in the standard stack-based interpreter? - Dispatch: getting control to the code corresponding to the next instruction. - Argument access: getting the values of the instruction parameters and arguments. - Actually performing the function. (Note: interpretation is a better win when this is slow!) ARGUMENT ACCESS COSTS Consider cost of stack element accesses: loads/stores to memory and sp adjustment. In standard stack implementation: ICONST_0: C code: *(++sp) = (u4) 0; machine code (sp in r9): add r9,4,r9 mov 0,r5 st r5,[r9] ISTORE_0: C code: locals[0] = *(sp--); machine code (sp in r9, locals in r10): ld [r9],r5 sub r9,4,r9 st r5,[r10] IADD: C code: int32 v2 = (int32) (*(sp--)); int32 v1 = (int32) (*sp); *sp = (u4) (v1+v2); machine code (assuming sp in r9) ld [r9],r5 sub r9,4,r9 ld [r9],r6 add r5,r6,r5 st r5,[r9] Assumptions: - Assume a register-rich target architecture, like a Sparc. - We've moved sp out of a global into a local variable (hence, into a register, we hope!) IDEA: *cache* the top of the stack in a local (i.e.,register). u4 tos; This is a win for IADD, a wash for the other instructions: ICONST_0: C code: *(++sp) = tos; tos = (u4) 0; machine code (tos in r1, sp in r9): add r9,r,r9 st r1,[r9] mov 0,r1 ISTORE_0: C code: locals[0] = tos; tos = *(sp--); machine code (tos in r1, sp in r9, locals in r10): st r1,[r10] sub r9,4,r9 ld [r9],,r1 IADD: C code: int32 v2 = (int32) tos; int32 v1 = (int32) (*(sp--)); tos = (u4) (v1+v2); machine code (assuming TOS in r1, sp in r9): ld [r9],r5 sub r9,4,r9 add r1,r5,r1 What about caching *two* elements (tos and atos)? IADD: C code: int32 v2 = (int32) tos; int32 v1 = (int32) atos; tos = (u4) (v1+v2); atos = *(sp--); machine code (TOS in r1, ATOS in r2, sp in r9): add r1,r2,r1 ld [r9],r2 add r9,4,r9 - No better. But if we allow code to end in a *different* state than it starts, can do better for many instructions. "Generalized stack caching" (Ertl) [see diagram s0,s1,s2 -- with the three ops.] - Can do dynamically, just by having three dispatch routines (switch on state as well as opcode). - Or *statically*: have three varieties of each opcode, and choose which one to use at compile time. Issue: what to do at join points. Aside: Why Stack-based interpreters? - Despite some early attempts otherwise, hardware processors almost always use registers. - Each HW instruction is parameterized by its registers (from a small fixed set). - Why is this good? Easy to decode (in parallel). - Why not for software machines too? - Stack machines take arguments implicitly; less (serial) decoding overhead. - Software registers cannot easily be stored in hardware registers [no equivalent to stack caching]; end up indexing an in-memory array. - On the other hand, register architectures require fewer instructions; hence less *dispatch*. - So maybe not a crazy idea (e.g Parrot for Perl).