INTERPRETER PERFORMANCE (2)
===========================

Next, what about speeding up dispatch code?

First of all, what does it look like now?

Following shows Sparc code corresponding to dispatch loop of interp.c:

----------

%l1 = pc
%l6 = table	
	
top:		
	ldub	[%l1], %o1
	and	%o1, 0xff, %g1
	cmp	%g1, 200
	bgu	undefined
	nop		
	sll	%g1, 2, %g1
	ld	[%l6+%g1], %o4
	jmp	%o4
	nop
table:
	.word	nop_snippet
	.word	aconst_null_snippet
	.word	iconst_m1_snippet
	.word	iconst_0_snippet
	.
	.
	.word	undefined
	.
	.
	.
	.word   goto_w_snippet


nop_snippet:	
	b top
	add %l1,1,%l1  ! delay slot


aconst_null_snippet:	
	...
	b top
	add %l1,1,%l1  ! delay slot

undefined:	
	! issue error and die
	
---------------------

Performance problems:
  - bounds check
  - two jumps per dispatch

First fix: (indirect) threaded code:
  - if we can do our own indirect jumps, can get rid of check and replicate dispatch.
  - not possible in ANSI standard C, but can do using GCC && operator
  
  - see interp_x2.c

  - improves performance somewhat, partly due to improved utilization of branch target buffer.

(Side question: how important is portability for an interpreter?)


Second, better fix: *direct* threaded code.
- Idea: Replace each byte code by the address of its snippet!
- Dispatch table disappears.
- dispatch is done by

   goto **pc;

   at end of each snippet.

- This improves performance by saving a fetch.

- BUT: Notice that we don't have original bytecode any more.

- Must rewrite before execution
   (Query: what to do with parameter bytes following opcodes?)

- Since we're doing that, *many* opportunities to clean things up
-- short-circuit constant pool references
-- combine code for similar opcodes
-- do static stack caching
-- etc.

If we *really* want to improve dispatch times, how about doing
fewer dispatches? 

- Idea: *Combine* adjacent instructions into "macro" or "super"-instructions.
- Example:

  LOAD
  BIPUSH
  ADD
  STORE

  (Oh look: this is already the IINC instruction!
   But can do for other combinations.)

- can do this:

  statically: for large group of programs, with fixed target VM.
    - (note: VM can be constructed semi-automatically from snippets).
    - have opportunity to optimize across snippet boundaries when compiling the VM.
    - we can use dynamic traces from a workload to make static decisions.
  statically: for single program, sending the encoding with the program 
    - (note: should also decrease size of program!)
  dynamically: build "superinstructions" on the fly from snippet code.
     (see paper by Piumarta & Riccardi)