CLASS LAYOUT REVISITED
======================

	----------------------       	   ----------------------
obj -->	|    class  ptr -----|-----------> |   constant pool ---|---->
	----------------------             ----------------------
	|   instance var 0   |            /|   method  0  ------|----> code
	----------------------           / ----------------------
	|   instance var 1   |          /  |   method  1  ------|----> code
	----------------------         /   ----------------------
	|        ...         |        /    |       ...          |
	----------------------       /     ----------------------
	|   instance var n   |      /      |   method m   ------|----> code
	----------------------     /       ----------------------
			          /        |  class var 1       | 
                                 /         ----------------------
        ----------------------  /          |     ...            | 
obj --> |   class  ptr ------|--	   ----------------------
        ----------------------             |  class var p       | 
	|      ....          |		   ----------------------
        ----------------------             |  interface count   |
					   ----------------------
					   |  interface 1  -----|----> iface
					   ----------------------
					   |      ...           |
					   ----------------------
					   |  interface q  -----|----> iface
					   ----------------------

	  ----------------------     
iface-->  |  method 0 ---------|----> code
	  ----------------------     
          |       ...          |
	  ----------------------     
          |  method r ---------|----> code
          ----------------------


static method ----------------------> code


BYTECODE AND CONSTANT POOL BEFORE RESOLUTION
============================================

BYTECODE
  ---------------------
  |  GETFIELD  	      |
  ---------------------
  |  indexbyte1       | \
  ---------------------  offset i into constant pool
  |  indexbyte2       | /
  ---------------------

  
STACK                                    CONSTANT POOL BEFORE RESOLUTION
   -------      --------                              -----------------
   | obj | ---> |class |     constant_pool + 0 --->   |     ...       | 
   -------      --------                              -----------------
		| var 1|	           + i --->   |  Fieldref j,k | 
                --------                              -----------------
		| ...  |	  	  	      |     ...       |
 	        --------                              -----------------
                | var n|                   + j --->   | Class name s  |
		--------		              _________________
				           + k --->   | Field name t  |
						      |    and type u |
                                                      -----------------
					   	      |     ...       |
                                                      -----------------
				           + s --->  *| String "cname"|
                                                      -----------------
				           + t --->  *| String "fname"|
                                                      -----------------
				           + u --->  *| String "fsign"|
                                                      -----------------

RESULTS OF RESOLUTION; QUICK OPCODES
====================================


STACK                                    CONSTANT POOL AFTER RESOLUTION
   -------      --------                              -----------------
   | obj | ---> |class |     constant_pool + 0 --->   |     ...       | 
   -------      --------                              ----------------- 
		| var 1|	           + i --->  *|   Field block-|-->
                --------                              -----------------
		|  ... |	  	  	      |     ...       |
 	        --------                              -----------------
                | var o|                   + j --->  *|   Class ptr --|-->
		--------		              _________________
		|  ... |	           + k --->  *| Nm/typ hashcd |
                --------                              -----------------
		| var n|		   	      |     ...       |
                --------                              -----------------
				           + s --->  *| String "cname"|
                                                      -----------------
				           + t --->  *| String "fname"|
                                                      -----------------
				           + u --->  *| String "fsign"|
                                                      -----------------
		------------
Field block --> | Offset o |
	        ------------
		| Type code|
                ____________
		|  ...     |
		------------

BYTECODE
  ---------------------
  |  GETFIELD_QUICK   |  (to fetch single words at small offsets)
  ---------------------
  |  instance offset o|
  ---------------------
  |    (unused)       |
  ---------------------

or

  ---------------------
  |  GETFIELD_QUICK_W |  (to fetch arbitrary fields or at larger offsets)
  ---------------------
  |  indexbyte1       | \
  ---------------------  offset i into constant pool, known to be resolved
  |  indexbyte2       | /
  ---------------------

RESOLVING METHODS
=================

Similarly, instance method names resolve to pointers to method blocks.

Quick forms of method invocation instructions include

  ------------------------
  |  INVOKEVIRTUAL_QUICK |  (for methods at small offsets)
  ------------------------
  |  method code offset  |
  ------------------------
  |     nargs            |
  ------------------------

or

  --------------------------
  |  INVOKEVIRTUAL_QUICK_W |  (for methods at large offsets)
  -------------------------
  |  indexbyte1            | \
  --------------------------  offset i into constant pool
  |  indexbyte2            | /          (known to be resolved to method block)
  -------------------------

Interface method names resolve essentially to (interface ptr,offset).
To invoke the method must search the interfaces in the receiving class
until one with matching interface pointer is found.  The quick form
memoizes the matching interface entry.  The offset is within the interface.

JIT COMPILING EXAMPLE
=====================

Java source code:

public static void foo(int a[], int b[], int i) {
	a[i] = (2 * a[i] + b[i]);
}


JVM code:

Method void foo(int[], int[], int)
   0 aload_0   ; a
   1 iload_2   ; i
   2 iconst_2  
   3 aload_0   ; a
   4 iload_2   ; i
   5 iaload    ; a[i]
   6 imul      ; a[i] * 2
   7 aload_1   ; b
   8 iload_2   ; i
   9 iaload    ; b[i]
  10 iadd      ; a[i] * 2 + b[i];
  11 iastore   ; a[i] 
  12 return

JIT code generation.

Assume RISC-like load/store architecture, with
registers r1,r2,r3,...,fp,sp.

Instructions: move r,r
              add r,r/c,r  
	      sll c,r,r
	      cmp r,r/c
	      br{eq/neq/leu/...} lab
	      load c(r),r
	      store r,c(r)

Assume constants have been resolved prior to JIT generation.

STACK LAYOUT
============

Stack layout for procedure with m locals, of which n are args;
and max stack depth s.

	----------------------
	| local 0 = arg 0    |
	----------------------
	|       ...          |
	----------------------
	| local n-1 = arg n-1|
	----------------------
	|    local n         |
	----------------------
	|      ...           |
	----------------------
	|    local m-1       |
	----------------------
	|  return addr       |
	----------------------
fp ->	|   saved fp         |
	----------------------
	|  operand stack 0   |
	----------------------
	|  operand stack 1   |
	----------------------
	|      ...           |
	----------------------
sp ->	|  operand stack p   |  (p < s)
	----------------------


STACK LAYOUT EXAMPLE
====================

For routine foo:

	----------------------
fp+16	|    arg 0 (a)       |
	----------------------
fp+12	|    arg 1 (b)       |
	----------------------
fp+8	|    arg 2 (i)       |
	----------------------
fp+4 	|  return addr       |
	----------------------
fp --->	|   saved fp         |  <---
	----------------------      
fp-4	|  operand stack 0   |      
	----------------------
fp-8	|  operand stack 1   |    possible
	----------------------     sp 
fp-12	|  operand stack 2   |    values
	----------------------
fp-16	|  operand stack 3   |
	----------------------
fp-20	|  operand stack 4   |  <--- 
	----------------------

Integer Arrays:

  	----------------------
a --->	|   class ptr -------|-------> Object
	----------------------
a+4	|      length        |
	----------------------
a+8	|      a[0]          | 
	----------------------
a+12	|      a[1]          | 
	----------------------
 	|      ...           |
	----------------------
a+8+i*4	|      a[i]          | 
	----------------------
        |      ...           |
        ----------------------
	|      a[length-1]   | 
	----------------------


JIT CODE VERSION 0
==================

Very naive; just lay down code equivalent to what's executed by interpreter.

foo:            store fp,(sp)	  ; save old frame pointer
		move sp,fp	  ; new frame pointer
;aload_0
	    	load 16(fp),r0    ; fetch a
		add sp,#-4,sp	  ; push
		store r0,(sp)     ; a
;iload_2	
		load 8(fp),r0	  ; fetch i
		add sp,#-4,sp	  ; push
		store r0,(sp)     ; i
;iconst_2
		move #2,r0
		add sp,#-4,sp	  ; push
		store r0,(sp)     ; 2
;aload_0 
	   	load 16(fp),r0    ; fetch a
		add sp,#-4,sp	  ; push
		store r0,(sp)     ; a
;iload_2	
		load 8(fp),r0     ; fetch i
		add sp,#-4,sp	  ; push
		store r0,(sp)     ; i
;iaload	
		load 4(sp),r0     ; fetch array ptr
		cmp r0,#0         ; null check (can often use MMU instead)
		bneq L1           ; branch if ok
                ...raise exception...	   
	    L1: load 4(r0),r1     ; array length
		load (sp),r2      ; index
		cmp r2,r1         ; bounds check
		brltu L2          ; branch if ok
		... raise exception ...
            L2: sll #2,r2,r2      ; calculate byte offset 
		add r0,r2,r2      ; add array base
		load 8(r2),r2     ; fetch a[i] 
		add sp,#4,sp	  ; pop one
		store r2,(sp)     ; push a[i] (replace)
;imul
		load (sp),r0      ; fetch a[i]
		load 4(sp),r1     ; 2
		mul r0,r1,r1      ; a[i] * 2
		add sp,#4,sp	  ; pop one
		store r1,-12(sp)  ; push a[i]*2 (replace)
;aload_1 
	        load 12(fp),r0    ; fetch b
		add sp,#-4,sp	  ; push
		store r0,(sp)     ; b
;iload_2				
		load 8(fp),r0	  ; fetch i
		add sp,#-4,sp	  ; push
		store r0,(sp)     ; i
;iaload		
		load 4(sp),r0     ; fetch array ptr
		cmp r0,#0         ; null check (can often use MMU instead)
		bneq L3           ; branch if ok
                ...raise exception...	   
	    L3: load 4(r0),r1     ; array length
		load (sp),r2      ; index
		cmp r2,r1         ; bounds check
		brltu L4          ; branch if ok
		... raise exception ...
            L4: sll #2,r2,r2      ; calculate byte offset 
		add r0,r2,r2      ; add array base
		load 8(r2),r2     ; fetch b[i]
		add sp,#4,sp	  ; pop one		
		store r2,(sp)     ; push b[i] (replace)
; iadd		
		load 4(sp),r0     ; fetch 2*a[i]
		load (sp),r1	  ; fetch b[i]
		add r0,r1,r1	  ; 2*a[i] + b[i]
		add sp,#4,sp	  ; pop one
		store r1,(sp)     ; push 2*a[i] + b[i] (replace)
; iastore
		load 8(sp),r0     ; fetch array ptr
		cmp r0,#0         ; null check (can often use MMU instead)
		bneq L5           ; branch if ok
                ...raise exception...	   
	    L5: load 4(r0),r1     ; array length
		load 4(sp),r2     ; index
		cmp r2,r1         ; bounds check
		brltu L6          ; branch if ok
		... raise exception ...
            L6: sll #2,r2,r2      ; calculate byte offset 
		add r0,r2,r2      ; add array base
		load (sp),r0	  ; fetch value
		store r0,8(r2)	  ; store a[i]
		add sp,#12,sp	  ; pop three
; return
		load (sp),fp	  ; restore frame pointer
		ret
			        		
71 instructions; 23 loads; 13 stores.

JIT CODE VERSION 1
==================

Since we always know current stack depth, can 
treat operand stack as an array, saving the cost of adjusting
sp each time we push or pop.

foo:            store fp,(sp)	  ; save old frame pointer
		move sp,fp	  ; new frame pointer
		add sp,#-20,sp    ; make space for operand stack
;aload_0
	    	load 16(fp),r0    ; fetch a
		store r0,-4(fp)   ; push a
;iload_2	
		load 8(fp),r0	  ; fetch i
		store r0,-8(fp)   ; push i
;iconst_2
		move #2,r0
		store r0,-12(fp)  ; push 2
;aload_0 
	   	load 16(fp),r0    ; fetch a
		store r0,-16(fp)  ; push a
;iload_2	
		load 8(fp),r0     ; fetch i
		store r0,-20(fp)  ; push i
;iaload	
		load -16(fp),r0   ; fetch array ptr
		cmp r0,#0         ; null check (can often use MMU instead)
		bneq L1           ; branch if ok
                ...raise exception...	   
	    L1: load 4(r0),r1     ; array length
		load -20(fp),r2   ; index
		cmp r2,r1         ; bounds check
		brltu L2          ; branch if ok
		... raise exception ...
            L2: sll #2,r2,r2      ; calculate byte offset 
		add r0,r2,r2      ; add array base
		load 8(r2),r2     ; fetch a[i]
		store r2,-16(fp)  ; push
;imul
		load -16(fp),r0   ; fetch a[i]
		load -12(fp),r1   ; 2
		mul r0,r1,r1      ; a[i] * 2
		store r1,-12(fp)  ; push a[i]*2
;aload_1 
	        load 12(fp),r0    ; fetch b
		store r0,-16(fp)  ; push b
;iload_2				
		load 8(fp),r0	  ; fetch i
		store r0,-20(fp)  ; push i
;iaload		
		load -16(fp),r0   ; fetch array ptr
		cmp r0,#0         ; null check (can often use MMU instead)
		bneq L3           ; branch if ok
                ...raise exception...	   
	    L3: load 4(r0),r1     ; array length
		load -20(fp),r2   ; index
		cmp r2,r1         ; bounds check
		brltu L4          ; branch if ok
		... raise exception ...
            L4: sll #2,r2,r2      ; calculate byte offset 
		add r0,r2,r2      ; add array base
		load 8(r2),r2     ; fetch b[i]
		store r2,-16(fp)  ; push
; iadd		
		load -12(fp),r0   ; fetch 2*a[i]
		load -16(fp),r1	  ; fetch b[i]
		add r0,r1,r1	  ; 2*a[i] + b[i]
		store r1,-12(fp)  ; push
; iastore
		load -4(fp),r0    ; fetch array ptr
		cmp r0,#0         ; null check (can often use MMU instead)
		bneq L5           ; branch if ok
                ...raise exception...	   
	    L5: load 4(r0),r1     ; array length
		load -8(fp),r2    ; index
		cmp r2,r1         ; bounds check
		brltu L6          ; branch if ok
		... raise exception ...
            L6: sll #2,r2,r2      ; calculate byte offset 
		add r0,r2,r2      ; add array base
		load -12(fp),r0	  ; fetch value
		store r0,8(r2)	  ; store a[i]
; return
		move fp,sp	  ; pop operand stack frame
		load (sp),fp	  ; restore frame pointer
		ret

61 instructions; 23 loads; 13 stores.

JIT CODE VERSION 2
==================

Cache stack and local variables in registers, spilling 
(not shown here) if necessary.

Note: This approach may not buy too much on a machine with few registers,
because of greatly increased spilling.

Register assignments:

Stack registers: r0,...,r4

Local variable registers: r10,r11,r12

Scratch registers: r20

In reality, arguments are likely to be in registers already, so those
local variables can just be assumed to be in place.

foo:            store fp,(sp)	  ; save old frame pointer
		move sp,fp	  ; new frame pointer
;aload_0
	    	load 16(fp),r10   ; fetch a
		move r10,r0	  ; push a
;iload_2	
		load 8(fp),r12	  ; fetch i
		move r12,r1	  ; push i
;iconst_2
		move #2,r2	  ; push 2
;aload_0 
		move r10,r3	  ; push a
;iload_2	
		move r12,r4	  ; push i
;iaload	
		cmp r3,#0         ; null check (can often use MMU instead)
		bneq L1           ; branch if ok
                ...raise exception...	   
	    L1: load 4(r3),r20    ; array length
		cmp r4,r20        ; bounds check
		brltu L2          ; branch if ok
		... raise exception ...
            L2: sll #2,r4,r4      ; calculate byte offset 
		add r3,r4,r4      ; add array base
		load 8(r4),r3     ; fetch and push a[i] 
;imul
		mul r2,r3,r2      ; push a[i]*2
;aload_1 
	        load 12(fp),r11   ; fetch b
		move r11,r3	  ; push b
;iload_2				
		move r12,r4	  ; push i
;iaload		
		cmp r3,#0         ; null check (can often use MMU instead)
		bneq L3           ; branch if ok
                ...raise exception...	   
	    L3: load 4(r3),r20     ; array length
		cmp r4,r20        ; bounds check
		brltu L4          ; branch if ok
		... raise exception ...
            L4: sll #2,r4,r4      ; calculate byte offset 
		add r3,r4,r4      ; add array base
		load 8(r4),r3     ; fetch and push b[i]
; iadd		
		add r2,r3,r2      ; push a[i]*2 + b[i]
; iastore
		cmp r0,#0         ; null check (can often use MMU instead)
		bneq L5           ; branch if ok
                ...raise exception...	   
	    L5: load 4(r0),r20    ; array length
		cmp r1,r20        ; bounds check
		brltu L6          ; branch if ok
		... raise exception ...
            L6: sll #2,r1,r1      ; calculate byte offset 
		add r0,r1,r1      ; add array base
		store r2,8(r1)    ; store a[i]
; return
		load (sp),fp	  ; restore frame pointer
		ret

41 instructions; 9 loads; 2 stores.

JIT(?) CODE VERSION 3
=====================

Moderately optimized via constant propagation,
common subexpression elimination, caching of intermediate
calculations from array manipulations.

If we need to spill, it's also desirable to have last-use information to avoid 
doing reloads.

Use registers r0,... on greedy basis; reuse when no further use.

Is it feasible to do this level of optimization in a JIT?

foo:            store fp,(sp)	  ; save old frame pointer
		move sp,fp	  ; new frame pointer
;aload_0
	    	load 16(fp),r0    ; fetch a
;iload_2	
		load 8(fp),r1	  ; fetch i
;iconst_2
;aload_0 
;iload_2	
;iaload	
		cmp r0,#0         ; null check (can often use MMU instead)
		bneq L1           ; branch if ok
                ...raise exception...	   
	    L1: load 4(r0),r2     ; array length
		cmp r1,r2         ; bounds check
		brltu L2          ; branch if ok
		... raise exception ...
            L2: sll #2,r1,r2      ; calculate byte offset 
		add r0,r2,r0      ; add array base
		load 8(r0),r3     ; fetch a[i] 
;imul
		sll #1,r3,r3	  ; do a[i]*2 via shift
;aload_1 
	        load 12(fp),r4    ; fetch b
;iload_2				
;iaload		
		cmp r4,#0         ; null check (can often use MMU instead)
		bneq L3           ; branch if ok
                ...raise exception...	   
	    L3: load 4(r4),r5     ; array length
		cmp r1,r5         ; bounds check
		brltu L4          ; branch if ok
		... raise exception ...
            L4: add r2,r4,r1      ; add array base
		load 8(r1),r1     ; fetch b[i]
; iadd		
		add r3,r1,r1      ; a[i]*2 + b[i]
; iastore
		store r1,8(r0)	  ; store a[i]
; return
		load (sp),fp	  ; restore frame pointer
		ret

25 instructions; 8 loads; 2 stores.

"HOT SPOT" JIT COMPILATION
==========================


- Ideas derived from SELF project (Stanford, Sun Labs); under development
for Java now.

- It's ok to spend longer optimizing code if we're confident
it will be executed often.

- Incorporate profiling into execution engine and observe frequency
of method calls.

- When a method is clearly popular, devote resources to recompiling it
with full optimization, and replace it on the fly.

- Careful characterization of "popular" is non-trivial.

- Also incorporate inlining optimizations (extremely valuable).  Can
use profile statistics to estimate most likely target of each method
invocation and optimize for that case.