EFFICIENT INTERPRETATION
------------------------

What are the costs in the standard stack-based interpreter?

- Dispatch: getting control to the code corresponding to the next instruction.
- Argument access: getting the values of the instruction parameters and arguments.
- Actually performing the function. (Note: interpretation is a better win when this is slow!)

ARGUMENT ACCESS COSTS

Consider cost of stack element accesses: loads/stores to memory and sp adjustment.

In standard stack implementation:
  
 ICONST_0:
   C code:
      *(++sp) = (u4) 0;
      
   machine code (sp in r9):
      add r9,4,r9
      mov 0,r5
      st r5,[r9]


 ISTORE_0:
   C code:
      locals[0] = *(sp--);
      
   machine code (sp in r9, locals in r10):
      ld [r9],r5
      sub r9,4,r9
      st r5,[r10]

 IADD:					
   C code:
	int32 v2 = (int32) (*(sp--));
	int32 v1 = (int32) (*sp);
	*sp = (u4) (v1+v2);
     
   machine code (assuming sp in r9)
        ld [r9],r5
	sub r9,4,r9
        ld [r9],r6
        add r5,r6,r5
	st r5,[r9]
   
 Assumptions: 
   - Assume a register-rich target architecture, like a Sparc.
   - We've moved sp out of a global into a local variable (hence, into a register, we hope!)     

IDEA:  *cache* the top of the stack in a local (i.e.,register).  

	u4 tos;


 This is a win for IADD, a wash for the other instructions:


 ICONST_0:
  C code:
	*(++sp) = tos;
        tos = (u4) 0;

  machine code (tos in r1, sp in r9):
        add r9,r,r9
        st r1,[r9]
        mov 0,r1

 ISTORE_0:
  C code:
        locals[0] = tos;
        tos = *(sp--);  

  machine code (tos in r1, sp in r9, locals in r10):
	st r1,[r10]
	sub r9,4,r9
        ld [r9],,r1

 IADD:
  C code:
	int32 v2 = (int32) tos;
	int32 v1 = (int32) (*(sp--));
	tos = (u4) (v1+v2);
	
  machine code (assuming TOS in r1, sp in r9):
	
	ld [r9],r5
	sub r9,4,r9
	add r1,r5,r1
     

What about caching *two* elements (tos and atos)?

 IADD:
   C code:
	int32 v2 = (int32) tos;
        int32 v1 = (int32) atos; 
        tos = (u4) (v1+v2);
        atos = *(sp--);
	
   machine code (TOS in r1, ATOS in r2, sp in r9):
	add r1,r2,r1
        ld [r9],r2
        add r9,4,r9
       

 - No better. But if we allow code to end in a *different* state than it starts, 
   can do better for many instructions.   "Generalized stack caching" (Ertl)

 [see diagram s0,s1,s2   -- with the three ops.]

 - Can do dynamically, just by having three dispatch routines
   (switch on state as well as opcode).

 - Or *statically*: have three varieties of each opcode, and choose
   which one to use at compile time. Issue: what to do at join points.


Aside: Why Stack-based interpreters?

 - Despite some early attempts otherwise, hardware processors almost always use registers.
 - Each HW instruction is parameterized by its registers (from a small fixed set).
 - Why is this good? Easy to decode (in parallel).
 - Why not for software machines too?
   - Stack machines take arguments implicitly; less (serial) decoding overhead.
   - Software registers cannot easily be stored in hardware registers [no equivalent to stack caching];
      end up indexing an in-memory array.
 - On the other hand, register architectures require fewer instructions; hence less *dispatch*.
   - So maybe not a crazy idea (e.g Parrot for Perl).