INTERPRETER PERFORMANCE (2) =========================== Next, what about speeding up dispatch code? First of all, what does it look like now? Following shows Sparc code corresponding to dispatch loop of interp.c: ---------- %l1 = pc %l6 = table top: ldub [%l1], %o1 and %o1, 0xff, %g1 cmp %g1, 200 bgu undefined nop sll %g1, 2, %g1 ld [%l6+%g1], %o4 jmp %o4 nop table: .word nop_snippet .word aconst_null_snippet .word iconst_m1_snippet .word iconst_0_snippet . . .word undefined . . . .word goto_w_snippet nop_snippet: b top add %l1,1,%l1 ! delay slot aconst_null_snippet: ... b top add %l1,1,%l1 ! delay slot undefined: ! issue error and die --------------------- Performance problems: - bounds check - two jumps per dispatch First fix: (indirect) threaded code: - if we can do our own indirect jumps, can get rid of check and replicate dispatch. - not possible in ANSI standard C, but can do using GCC && operator - see interp_x2.c - improves performance somewhat, partly due to improved utilization of branch target buffer. (Side question: how important is portability for an interpreter?) Second, better fix: *direct* threaded code. - Idea: Replace each byte code by the address of its snippet! - Dispatch table disappears. - dispatch is done by goto **pc; at end of each snippet. - This improves performance by saving a fetch. - BUT: Notice that we don't have original bytecode any more. - Must rewrite before execution (Query: what to do with parameter bytes following opcodes?) - Since we're doing that, *many* opportunities to clean things up -- short-circuit constant pool references -- combine code for similar opcodes -- do static stack caching -- etc. If we *really* want to improve dispatch times, how about doing fewer dispatches? - Idea: *Combine* adjacent instructions into "macro" or "super"-instructions. - Example: LOAD BIPUSH ADD STORE (Oh look: this is already the IINC instruction! But can do for other combinations.) - can do this: statically: for large group of programs, with fixed target VM. - (note: VM can be constructed semi-automatically from snippets). - have opportunity to optimize across snippet boundaries when compiling the VM. - we can use dynamic traces from a workload to make static decisions. statically: for single program, sending the encoding with the program - (note: should also decrease size of program!) dynamically: build "superinstructions" on the fly from snippet code. (see paper by Piumarta & Riccardi)