SquirrelFish
If you’re reading this, chances are that you already know about SquirrelFish, Appl/WebKit’s new Javascript implementation. Early tests show SquirrelFish to be 60% faster than WebKit 3.1 JS, 46% faster than Spidermonkey and 52% faster than TT (Tamarin Tracing) on SunSpider.
Clearly we have some work to do. The plan is to improve TT so that hot loops run highly optimized native code; TT’s optimizer is in the early stages, and we think there’s a lot of room for further optimization. For example, as explained in my previous posts, when TT jumps from one trace to another, it has to save the interpreter state to a standard format and then reload the state on the new trace, and no cross-trace optimization is possible. Ideas like trace folding have potential for a big improvement here.
But I still wanted to take a quick peek into why SquirrelFish’s interpreter is so fast. The announcement touts 3 key design features:
- Using a bytecode interpreter.
- Using direct-threaded interpreter dispatch.
- Using a register-based bytecode interpreter.
Bytecode interpreter. The old WebKit JS was based on an AST walk, which is explained in some detail in the announcement. It’s well-known that bytecode interpreters are faster, so it’s no surprise Apple made this change. Spidermonkey and TT’s interpreter have always been bytecode interpreters.
Direct-threaded dispatch. Interpreters tend to spend a lot of time dispatching bytecode operations. Direct-threaded dispatch is a technique for efficient dispatch.
The obvious way to write a bytecode interpreter is with a switch statement inside a loop:
void run(Bytecode *ip) {
for (;;) {
switch (*ip++) {
case OP_ADD: …
case OP_JUMP: …
Each iteration runs one bytecode instruction. Each case of the switch handles one instruction type. It really doesn’t look like there’s any room for improvement here unless you look at the assembly code generated for the switch dispatch (from a tiny test interpreter I wrote today, comments added by me):
# ip is in %edx
# Check that switch expression (%edx) is in table range
cmpl $9, (%edx)
leal 4(%edx), %ecx
ja L26
# Look up case address offset for (%edx) in table (L37)
movl (%edx), %eax
movl L37-”L00000000006$pb”(%ebx,%eax,4), %eax
# Add base address to offset
addl %ebx, %eax
# Indirect jump to computed address
jmp *%eax
The basic idea is to keep a table of relative offsets to the cases, and then jump using that offset. Because the switch expression could evaluate to anything, the compiler must first generate a range check, so that if the switch expression doesn’t map to any case, the program leaves the switch instead of crashing unpredictably.
But in reality the range check is useless, because the interpreter can control what bytecodes actually appear, so this is 3 wasted instructions. Also, the lookup and base+offset address computation seems kind of clunky. I’d prefer dispatch code something like this:
jmp *%edx
This is direct threading. In principle, the idea is simple: the instruction code (e.g., OP_ADD) is the address of the case target code. (In the basic design, instruction codes are integers.) Then, this jump is all you need for dispatch.
Coding up direct threading is weird; normal C compilers don’t know how to do it. But it can be done with GCC’s computed goto extension (see the paper on direct threading from the SquirrelFish announcement). See also my tiny interpreter.
I believe TT and Spidermonkey use an intermediate design called indirect threading, which gets most of the speedup of direct threading, but allows integer opcodes. The opcode is an index into a table of case target addresses. So the dispatch code has to look up the case target, then jump, something like:
leal TABLE(%edx,4), %eax
jmp *%eax
Not too bad. I have no idea how significant the difference between direct and indirect threading is in practice, but even a few percent speedup would be great.
Register-based interpreter. “Traditional” interpreter design keeps all operands and intermediate values of expressions on a data stack. (This is completely different from the call stack.) I can think of a few design reasons why this might be, but I really don’t know why it’s been done this way. Anyway, with a stack, bytecode for a line of code like “x = a + b + c” looks like this:
LOAD a # push a onto stack
LOAD b
ADD # pop 2 elements, add them, push the result
LOAD c
ADD
STORE x
One nice thing about this is that the algorithm for generating this bytecode is very simple, so it’s not a lot of work to code and runs fast. But there’s a lot of code just to manipulate the stack here, and it might be nice to avoid it. (Real stack-based programs have even more stack manipulation, with DUP, SWAP, ROT, etc. operations.)
A register-based design avoids the stack by using a fixed array of “slots” or “registers” for operands. (These registers are completely different from assembly code registers.) The register-based version of the above line of code would be something like:
ADD temp, a, b # temp = a + b
ADD x, temp, c
Keep in mind that in the register-based bytecode, where I have “x” it would actually have something like “0″, if x had been assigned slot 0 in the register table. Note also that the bytecode generator has to analyze the code a bit to decide how many registers are needed and assign them to the variables and temporaries.
Each design has advantages and disadvantages:
- Bytecode code size. Stack-based bytecode programs tend to be smaller, because the operands are implicit. But the register-based program gets to omit the stack manipulation instructions, so the stack-based bytecode should be only a little bit smaller. The memory savings shouldn’t matter in most environments, but reading less bytecode will save the interpreter a bit of time.
- Bytecode instruction count. As shown above, a register-based program will have fewer bytecodes. Fewer bytecodes to execute means faster run times, if it takes the same length of time to execute a bytecode in each design. Which it doesn’t:
- Operand access. To access operands, the stack-based interpreter just reads and writes from the top of the stack or very near it. A register-based interpreter needs a two-step process: first it has to get the address of that register, then read or write it. But the register-based version doesn’t need to adjust the stack. Still, it seems like more work per instruction for the register version: for ADD, the register version needs to compute 3 operand addresses, while the stack version just one (the new stack pointer).
Overall, when you look across instructions, I think the total amount of operand addressing computation is about the same. The stack version distributes it across more instructions. The stack version might be able to save a few by having special instructions like LOAD0 (to load slot 0, needing no computation). But the register version still has few instructions, so fewer dispatches. With direct threading, the dispatch is fairly efficient, so this is less important, but an indirect jump is still usually pretty expensive compared to a normal instruction because of the high branch mispredict probability.
Microbenchmarking. I wrote a tiny interpreter that runs a bytecode program for an empty for loop using doubles as numbers (as in JS) to test out these things for myself. I tried 4 design choices: direct threading vs. switch and stack vs. register. The times for 10M iterations in milliseconds:
| Stack | Register | |
| Switch | 230 | 90 |
| Direct | 105 | 55 |
Here you see huge differences. The reason the differences are so big (far bigger than WebKit got) is that my tiny interpreter’s opcodes are very simple. In a real JS interpreter, the code to run for the average operation is a lot longer, so dispatch overhead is a much smaller fraction of total time, and dispatch overhead is the main thing saved by direct-threading and registers.
Note that the speedup of register vs. stack is smaller in the direct-threaded case, as I predicted above. But not that much smaller, because the dispatch really is expensive compared to everything else in this simple interpreter.
You can get my microbenchmark code here.
Posted: June 3rd, 2008 under Uncategorized.
Comments: 22
Comments
Pingback from Safari’s New JS Interpreter: SquirrelFish | Robert Accettura’s Fun With Wordage
Time: June 3, 2008, 6:33 pm
[...] David Mandelin has some analysis and comparison to the Mozilla work being done on his blog. [...]
Comment from Rob Mueller
Time: June 3, 2008, 6:51 pm
One thing I noticed your direct threaded interpreter does is this:
#define ENDCASE COUNT_INCR goto **ip++;
That puts the indirect jump at the end of every VM instruction. Because your stream is so small, and most instructions effectively dictate the next one (eg JGE always comes after LOAD1), this should result in MUCH better branch prediction on most modern CPUs with indirect branch predictors (eg CORE architecture ones - http://arstechnica.com/articles/paedia/cpu/core.ars/7). At least I think that’s true
It would be interesting to see what the difference is changing this to:
#define ENDCASE COUNT_INCR continue;
Rob
Comment from Chris Double
Time: June 3, 2008, 6:54 pm
You can find some performance comparisons of direct vs indirect threading here:
Comment from Shawn Wilsher
Time: June 3, 2008, 7:03 pm
So, I guess the question is, would spidermonkey/actionmonkey/TT benefit from going from stack to register-based bytecode?
Comment from wonderer
Time: June 3, 2008, 9:42 pm
I wonder why don’t Mozilla just use Webkit’s Javascript Interpreter? Licensing (can be it sorted out with Webkit copyright holders) or it’s about politics? Since Mozilla 2 is still making it’s first steps so why not pick it up instead of TT?
Comment from CH Gowri Kumar
Time: June 3, 2008, 10:21 pm
Here is a good reference which talks more about threaded-code and also how to implemeet using gcc c’s label as values feature:
http://www.complang.tuwien.ac.at/forth/threaded-code.html
Comment from Bill McCloskey
Time: June 3, 2008, 10:33 pm
Hi Dave,
This is all pretty intriguing. How does this SunSpot test actually work? I tried it out and it seems like each test runs for only a few hundred milliseconds, but the tests are repeated in a loop. Does Tamarin cache compiled code, or does it have to recompile each time a given test is run?
I imagine that runtimes in the hundred millisecond range are actually pretty typical in web pages. In that case, for a JIT approach to be profitable it seems like you’ll need pretty aggressive caching of compiled code. Is something like this in the works?
Overall, it seems like there’s much to be learned from what happened to Java. The fact that JVMs typically never cache compiled code means that they have horrible startup times and the end result is that everyone hates Java. Using an interpreter on cold code can ameliorate the problem somewhat, but as code size grows it will never be enough. I’m not sure if JavaScript programs will ever approach the size of the Java class libraries, but they’re definitely going to get big. Unlike most interpreted languages like Python/Perl/Ruby, which cheat by writing lots of library code in C, you guys have the same security considerations as Java, so the libraries will have to be in JS.
SquirrelFish may be doing better in the short term, but in the long terms it’ll never be able to compete with well-optimized code compiled ahead of time [hopefully with type tagging specialized away dynamically a la Psyco :-)].
-Bill
Comment from san
Time: June 3, 2008, 11:45 pm
Nice one. Webkit says that considering that they have now moved on to a byte code interpreter, they are planning on doing a lot of optimizations on the VM level. What kind of optimizations does TT already have considering that they have been using byte codes for quite some time?
Comment from Hovik Melikyan
Time: June 4, 2008, 4:35 am
Looks like direct threading is possible without any assembly programming: make sure the switch control variable is unsigned char and then declare precisely 256 case labels. The compiler will omit range checking then. There will be one extra instruction though, that extends the opcode from 8 to 32 bits before selecting the jump destination from the table.
Comment from localhost
Time: June 4, 2008, 4:38 am
Great Article!! Thoughtful and clearly explained.
Comment from Hovik Melikyan
Time: June 4, 2008, 4:40 am
Re: stack vs. register approach. I think the stack version can be made a bit better by introducing a variation of STORE that doesn’t actually pop the value from the stack, so that it can be re-used in the next expression if necessary (copy propagation). For example:
b = a;
c = a;
translates to:
LOAD a
STOREC b
STORE c
Comment from Johan Tibell
Time: June 4, 2008, 5:09 am
In your microbenchmark code you define the program locally but in a real interpreter it would have to be passed to the run function as a parameter. However, it isn’t possible to refer to the goto labels outside the function in which they are declared so what would an efficient transformation from some intermediate format to your itype[] format look like? Also, you use sizeof(void *) sized opcodes. Would the above mentioned intermediate format use sizeof(char) sized opcodes instead?
Pingback from Javascript News » Blog Archive » SquirrelFish: Technical excitement
Time: June 4, 2008, 5:11 am
[...] David Mandelin of Mozilla, who has been doing a great job discussing Tamarin internals, shared his look at SquirrelFish and the improvements: Using a bytecode interpreter, Using direct-threaded interpreter dispatch, [...]
Comment from Hovik Melikyan
Time: June 4, 2008, 5:40 am
One more thing… I think type inference might be very useful, if it’s what I think it is - determining exact types of variables at compile time. At least within one function this kind of optimization might help eliminating unnecessary typecasts for each opcode that deals with data.
Comment from surana
Time: June 4, 2008, 6:06 am
You should read this paper: http://portal.acm.org/citation.cfm?id=1059584
And this paper: http://portal.acm.org/citation.cfm?id=277743&coll=GUIDE&dl=GUIDE&CFID=3672489&CFTOKEN=45721748
The first paper claims ~13% better than direct threading. The second claims some numerical operations are only 30% slower than C.
Pingback from Dave Mandelin of Mozilla explains Squirr … « Random Trap
Time: June 4, 2008, 9:16 am
[...] javascript Dave Mandelin of Mozilla explains Squirrelfish, WebKit’s new JS Interpreter: http://blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/ [...]
Comment from Matthew
Time: June 4, 2008, 4:08 pm
Interesting stuff.
Might another advantage of the register-based approach be that it’s easier to map to register-based CPU architectures when JIT compiling bits of bytecode?
I dunno, I’m asking ![]()
Comment from Andreas
Time: June 5, 2008, 7:35 am
Register-based bytecodes are not necessarily easier to compile, especially if you are trying to compile with a very fast and simple JIT. The stack semantics implicitly tells you when temporary values go dead (and the associated register can be de-allocated), wheres in case of registers you need proper liveness analysis.
Comment from SD
Time: June 13, 2008, 12:01 am
wonderer - The reason Mozilla are still developing their own JS engine is simple - Mozilla’s JS engine supports a lot of stuff that Safari’s does not. Type annotations, for example, or all kinds of other ECMAScript 3 and 4 features. The entire UI of Firefox is written using JavaScript, and it makes heavy use of all those features that would be missing with Safari’s JS engine.
Comment from chad
Time: June 16, 2008, 4:18 pm
some non-technical trivia: There is a bootleg store in China called “Squirrel–shaped Fish”. In China it’s standard procedure for stores to illegally use names and logos from large international brands. With this store, they stole the Lacoste alligator logo, but made their own name. Pretty genius.
Pingback from Inline threading, TraceMonkey, etc. :: David Mandelin’s blog
Time: August 27, 2008, 6:57 pm
[...] Basically, this is yet another interpreter opcode dispatch technique. I previously wrote about opcode dispatch, concluding that direct-threading, in which the opcode is a target address, and the code to start [...]
Pingback from Daniel Shorten » Chrome:
Time: September 12, 2008, 9:35 am
[...] David Mandelin’s blog » SquirrelFish Details about a highly optimized JavaScript engine. [...]
Write a comment