CLASS LAYOUT REVISITED ====================== ---------------------- ---------------------- obj --> | class ptr -----|-----------> | constant pool ---|----> ---------------------- ---------------------- | instance var 0 | /| method 0 ------|----> code ---------------------- / ---------------------- | instance var 1 | / | method 1 ------|----> code ---------------------- / ---------------------- | ... | / | ... | ---------------------- / ---------------------- | instance var n | / | method m ------|----> code ---------------------- / ---------------------- / | class var 1 | / ---------------------- ---------------------- / | ... | obj --> | class ptr ------|-- ---------------------- ---------------------- | class var p | | .... | ---------------------- ---------------------- | interface count | ---------------------- | interface 1 -----|----> iface ---------------------- | ... | ---------------------- | interface q -----|----> iface ---------------------- ---------------------- iface--> | method 0 ---------|----> code ---------------------- | ... | ---------------------- | method r ---------|----> code ---------------------- static method ----------------------> code BYTECODE AND CONSTANT POOL BEFORE RESOLUTION ============================================ BYTECODE --------------------- | GETFIELD | --------------------- | indexbyte1 | \ --------------------- offset i into constant pool | indexbyte2 | / --------------------- STACK CONSTANT POOL BEFORE RESOLUTION ------- -------- ----------------- | obj | ---> |class | constant_pool + 0 ---> | ... | ------- -------- ----------------- | var 1| + i ---> | Fieldref j,k | -------- ----------------- | ... | | ... | -------- ----------------- | var n| + j ---> | Class name s | -------- _________________ + k ---> | Field name t | | and type u | ----------------- | ... | ----------------- + s ---> *| String "cname"| ----------------- + t ---> *| String "fname"| ----------------- + u ---> *| String "fsign"| ----------------- RESULTS OF RESOLUTION; QUICK OPCODES ==================================== STACK CONSTANT POOL AFTER RESOLUTION ------- -------- ----------------- | obj | ---> |class | constant_pool + 0 ---> | ... | ------- -------- ----------------- | var 1| + i ---> *| Field block-|--> -------- ----------------- | ... | | ... | -------- ----------------- | var o| + j ---> *| Class ptr --|--> -------- _________________ | ... | + k ---> *| Nm/typ hashcd | -------- ----------------- | var n| | ... | -------- ----------------- + s ---> *| String "cname"| ----------------- + t ---> *| String "fname"| ----------------- + u ---> *| String "fsign"| ----------------- ------------ Field block --> | Offset o | ------------ | Type code| ____________ | ... | ------------ BYTECODE --------------------- | GETFIELD_QUICK | (to fetch single words at small offsets) --------------------- | instance offset o| --------------------- | (unused) | --------------------- or --------------------- | GETFIELD_QUICK_W | (to fetch arbitrary fields or at larger offsets) --------------------- | indexbyte1 | \ --------------------- offset i into constant pool, known to be resolved | indexbyte2 | / --------------------- RESOLVING METHODS ================= Similarly, instance method names resolve to pointers to method blocks. Quick forms of method invocation instructions include ------------------------ | INVOKEVIRTUAL_QUICK | (for methods at small offsets) ------------------------ | method code offset | ------------------------ | nargs | ------------------------ or -------------------------- | INVOKEVIRTUAL_QUICK_W | (for methods at large offsets) ------------------------- | indexbyte1 | \ -------------------------- offset i into constant pool | indexbyte2 | / (known to be resolved to method block) ------------------------- Interface method names resolve essentially to (interface ptr,offset). To invoke the method must search the interfaces in the receiving class until one with matching interface pointer is found. The quick form memoizes the matching interface entry. The offset is within the interface. JIT COMPILING EXAMPLE ===================== Java source code: public static void foo(int a[], int b[], int i) { a[i] = (2 * a[i] + b[i]); } JVM code: Method void foo(int[], int[], int) 0 aload_0 ; a 1 iload_2 ; i 2 iconst_2 3 aload_0 ; a 4 iload_2 ; i 5 iaload ; a[i] 6 imul ; a[i] * 2 7 aload_1 ; b 8 iload_2 ; i 9 iaload ; b[i] 10 iadd ; a[i] * 2 + b[i]; 11 iastore ; a[i] 12 return JIT code generation. Assume RISC-like load/store architecture, with registers r1,r2,r3,...,fp,sp. Instructions: move r,r add r,r/c,r sll c,r,r cmp r,r/c br{eq/neq/leu/...} lab load c(r),r store r,c(r) Assume constants have been resolved prior to JIT generation. STACK LAYOUT ============ Stack layout for procedure with m locals, of which n are args; and max stack depth s. ---------------------- | local 0 = arg 0 | ---------------------- | ... | ---------------------- | local n-1 = arg n-1| ---------------------- | local n | ---------------------- | ... | ---------------------- | local m-1 | ---------------------- | return addr | ---------------------- fp -> | saved fp | ---------------------- | operand stack 0 | ---------------------- | operand stack 1 | ---------------------- | ... | ---------------------- sp -> | operand stack p | (p < s) ---------------------- STACK LAYOUT EXAMPLE ==================== For routine foo: ---------------------- fp+16 | arg 0 (a) | ---------------------- fp+12 | arg 1 (b) | ---------------------- fp+8 | arg 2 (i) | ---------------------- fp+4 | return addr | ---------------------- fp ---> | saved fp | <--- ---------------------- fp-4 | operand stack 0 | ---------------------- fp-8 | operand stack 1 | possible ---------------------- sp fp-12 | operand stack 2 | values ---------------------- fp-16 | operand stack 3 | ---------------------- fp-20 | operand stack 4 | <--- ---------------------- Integer Arrays: ---------------------- a ---> | class ptr -------|-------> Object ---------------------- a+4 | length | ---------------------- a+8 | a[0] | ---------------------- a+12 | a[1] | ---------------------- | ... | ---------------------- a+8+i*4 | a[i] | ---------------------- | ... | ---------------------- | a[length-1] | ---------------------- JIT CODE VERSION 0 ================== Very naive; just lay down code equivalent to what's executed by interpreter. foo: store fp,(sp) ; save old frame pointer move sp,fp ; new frame pointer ;aload_0 load 16(fp),r0 ; fetch a add sp,#-4,sp ; push store r0,(sp) ; a ;iload_2 load 8(fp),r0 ; fetch i add sp,#-4,sp ; push store r0,(sp) ; i ;iconst_2 move #2,r0 add sp,#-4,sp ; push store r0,(sp) ; 2 ;aload_0 load 16(fp),r0 ; fetch a add sp,#-4,sp ; push store r0,(sp) ; a ;iload_2 load 8(fp),r0 ; fetch i add sp,#-4,sp ; push store r0,(sp) ; i ;iaload load 4(sp),r0 ; fetch array ptr cmp r0,#0 ; null check (can often use MMU instead) bneq L1 ; branch if ok ...raise exception... L1: load 4(r0),r1 ; array length load (sp),r2 ; index cmp r2,r1 ; bounds check brltu L2 ; branch if ok ... raise exception ... L2: sll #2,r2,r2 ; calculate byte offset add r0,r2,r2 ; add array base load 8(r2),r2 ; fetch a[i] add sp,#4,sp ; pop one store r2,(sp) ; push a[i] (replace) ;imul load (sp),r0 ; fetch a[i] load 4(sp),r1 ; 2 mul r0,r1,r1 ; a[i] * 2 add sp,#4,sp ; pop one store r1,-12(sp) ; push a[i]*2 (replace) ;aload_1 load 12(fp),r0 ; fetch b add sp,#-4,sp ; push store r0,(sp) ; b ;iload_2 load 8(fp),r0 ; fetch i add sp,#-4,sp ; push store r0,(sp) ; i ;iaload load 4(sp),r0 ; fetch array ptr cmp r0,#0 ; null check (can often use MMU instead) bneq L3 ; branch if ok ...raise exception... L3: load 4(r0),r1 ; array length load (sp),r2 ; index cmp r2,r1 ; bounds check brltu L4 ; branch if ok ... raise exception ... L4: sll #2,r2,r2 ; calculate byte offset add r0,r2,r2 ; add array base load 8(r2),r2 ; fetch b[i] add sp,#4,sp ; pop one store r2,(sp) ; push b[i] (replace) ; iadd load 4(sp),r0 ; fetch 2*a[i] load (sp),r1 ; fetch b[i] add r0,r1,r1 ; 2*a[i] + b[i] add sp,#4,sp ; pop one store r1,(sp) ; push 2*a[i] + b[i] (replace) ; iastore load 8(sp),r0 ; fetch array ptr cmp r0,#0 ; null check (can often use MMU instead) bneq L5 ; branch if ok ...raise exception... L5: load 4(r0),r1 ; array length load 4(sp),r2 ; index cmp r2,r1 ; bounds check brltu L6 ; branch if ok ... raise exception ... L6: sll #2,r2,r2 ; calculate byte offset add r0,r2,r2 ; add array base load (sp),r0 ; fetch value store r0,8(r2) ; store a[i] add sp,#12,sp ; pop three ; return load (sp),fp ; restore frame pointer ret 71 instructions; 23 loads; 13 stores. JIT CODE VERSION 1 ================== Since we always know current stack depth, can treat operand stack as an array, saving the cost of adjusting sp each time we push or pop. foo: store fp,(sp) ; save old frame pointer move sp,fp ; new frame pointer add sp,#-20,sp ; make space for operand stack ;aload_0 load 16(fp),r0 ; fetch a store r0,-4(fp) ; push a ;iload_2 load 8(fp),r0 ; fetch i store r0,-8(fp) ; push i ;iconst_2 move #2,r0 store r0,-12(fp) ; push 2 ;aload_0 load 16(fp),r0 ; fetch a store r0,-16(fp) ; push a ;iload_2 load 8(fp),r0 ; fetch i store r0,-20(fp) ; push i ;iaload load -16(fp),r0 ; fetch array ptr cmp r0,#0 ; null check (can often use MMU instead) bneq L1 ; branch if ok ...raise exception... L1: load 4(r0),r1 ; array length load -20(fp),r2 ; index cmp r2,r1 ; bounds check brltu L2 ; branch if ok ... raise exception ... L2: sll #2,r2,r2 ; calculate byte offset add r0,r2,r2 ; add array base load 8(r2),r2 ; fetch a[i] store r2,-16(fp) ; push ;imul load -16(fp),r0 ; fetch a[i] load -12(fp),r1 ; 2 mul r0,r1,r1 ; a[i] * 2 store r1,-12(fp) ; push a[i]*2 ;aload_1 load 12(fp),r0 ; fetch b store r0,-16(fp) ; push b ;iload_2 load 8(fp),r0 ; fetch i store r0,-20(fp) ; push i ;iaload load -16(fp),r0 ; fetch array ptr cmp r0,#0 ; null check (can often use MMU instead) bneq L3 ; branch if ok ...raise exception... L3: load 4(r0),r1 ; array length load -20(fp),r2 ; index cmp r2,r1 ; bounds check brltu L4 ; branch if ok ... raise exception ... L4: sll #2,r2,r2 ; calculate byte offset add r0,r2,r2 ; add array base load 8(r2),r2 ; fetch b[i] store r2,-16(fp) ; push ; iadd load -12(fp),r0 ; fetch 2*a[i] load -16(fp),r1 ; fetch b[i] add r0,r1,r1 ; 2*a[i] + b[i] store r1,-12(fp) ; push ; iastore load -4(fp),r0 ; fetch array ptr cmp r0,#0 ; null check (can often use MMU instead) bneq L5 ; branch if ok ...raise exception... L5: load 4(r0),r1 ; array length load -8(fp),r2 ; index cmp r2,r1 ; bounds check brltu L6 ; branch if ok ... raise exception ... L6: sll #2,r2,r2 ; calculate byte offset add r0,r2,r2 ; add array base load -12(fp),r0 ; fetch value store r0,8(r2) ; store a[i] ; return move fp,sp ; pop operand stack frame load (sp),fp ; restore frame pointer ret 61 instructions; 23 loads; 13 stores. JIT CODE VERSION 2 ================== Cache stack and local variables in registers, spilling (not shown here) if necessary. Note: This approach may not buy too much on a machine with few registers, because of greatly increased spilling. Register assignments: Stack registers: r0,...,r4 Local variable registers: r10,r11,r12 Scratch registers: r20 In reality, arguments are likely to be in registers already, so those local variables can just be assumed to be in place. foo: store fp,(sp) ; save old frame pointer move sp,fp ; new frame pointer ;aload_0 load 16(fp),r10 ; fetch a move r10,r0 ; push a ;iload_2 load 8(fp),r12 ; fetch i move r12,r1 ; push i ;iconst_2 move #2,r2 ; push 2 ;aload_0 move r10,r3 ; push a ;iload_2 move r12,r4 ; push i ;iaload cmp r3,#0 ; null check (can often use MMU instead) bneq L1 ; branch if ok ...raise exception... L1: load 4(r3),r20 ; array length cmp r4,r20 ; bounds check brltu L2 ; branch if ok ... raise exception ... L2: sll #2,r4,r4 ; calculate byte offset add r3,r4,r4 ; add array base load 8(r4),r3 ; fetch and push a[i] ;imul mul r2,r3,r2 ; push a[i]*2 ;aload_1 load 12(fp),r11 ; fetch b move r11,r3 ; push b ;iload_2 move r12,r4 ; push i ;iaload cmp r3,#0 ; null check (can often use MMU instead) bneq L3 ; branch if ok ...raise exception... L3: load 4(r3),r20 ; array length cmp r4,r20 ; bounds check brltu L4 ; branch if ok ... raise exception ... L4: sll #2,r4,r4 ; calculate byte offset add r3,r4,r4 ; add array base load 8(r4),r3 ; fetch and push b[i] ; iadd add r2,r3,r2 ; push a[i]*2 + b[i] ; iastore cmp r0,#0 ; null check (can often use MMU instead) bneq L5 ; branch if ok ...raise exception... L5: load 4(r0),r20 ; array length cmp r1,r20 ; bounds check brltu L6 ; branch if ok ... raise exception ... L6: sll #2,r1,r1 ; calculate byte offset add r0,r1,r1 ; add array base store r2,8(r1) ; store a[i] ; return load (sp),fp ; restore frame pointer ret 41 instructions; 9 loads; 2 stores. JIT(?) CODE VERSION 3 ===================== Moderately optimized via constant propagation, common subexpression elimination, caching of intermediate calculations from array manipulations. If we need to spill, it's also desirable to have last-use information to avoid doing reloads. Use registers r0,... on greedy basis; reuse when no further use. Is it feasible to do this level of optimization in a JIT? foo: store fp,(sp) ; save old frame pointer move sp,fp ; new frame pointer ;aload_0 load 16(fp),r0 ; fetch a ;iload_2 load 8(fp),r1 ; fetch i ;iconst_2 ;aload_0 ;iload_2 ;iaload cmp r0,#0 ; null check (can often use MMU instead) bneq L1 ; branch if ok ...raise exception... L1: load 4(r0),r2 ; array length cmp r1,r2 ; bounds check brltu L2 ; branch if ok ... raise exception ... L2: sll #2,r1,r2 ; calculate byte offset add r0,r2,r0 ; add array base load 8(r0),r3 ; fetch a[i] ;imul sll #1,r3,r3 ; do a[i]*2 via shift ;aload_1 load 12(fp),r4 ; fetch b ;iload_2 ;iaload cmp r4,#0 ; null check (can often use MMU instead) bneq L3 ; branch if ok ...raise exception... L3: load 4(r4),r5 ; array length cmp r1,r5 ; bounds check brltu L4 ; branch if ok ... raise exception ... L4: add r2,r4,r1 ; add array base load 8(r1),r1 ; fetch b[i] ; iadd add r3,r1,r1 ; a[i]*2 + b[i] ; iastore store r1,8(r0) ; store a[i] ; return load (sp),fp ; restore frame pointer ret 25 instructions; 8 loads; 2 stores. "HOT SPOT" JIT COMPILATION ========================== - Ideas derived from SELF project (Stanford, Sun Labs); under development for Java now. - It's ok to spend longer optimizing code if we're confident it will be executed often. - Incorporate profiling into execution engine and observe frequency of method calls. - When a method is clearly popular, devote resources to recompiling it with full optimization, and replace it on the fly. - Careful characterization of "popular" is non-trivial. - Also incorporate inlining optimizations (extremely valuable). Can use profile statistics to estimate most likely target of each method invocation and optimize for that case.