Main menu:

Site search

Categories

Archive

Squirrelfishing regexp-dna.js

Recently I’ve been trying to figure out exactly how SquirrelFish Extreme (SFX) is kicking our butts so badly on regexp-dna.js, by about 5x on my machine. Numerically, (WebKit Regular Expression Compiler) WREC provides most of that 5x, but there are some weird twists to the story. My main conclusions: WREC indeed makes for a good regex engine, but also, WREC bails out of any regex with parentheses, and regexp-dna.js contains a bug that favors SFX over SpiderMonkey.

Update: The bug in regexp-dna.js has been fixed in WebKit! Thanks, WebKit team! That was fast. The new code:

for(k in subs)
dnaInput = dnaInput.replace(k, subs[k])
// FIXME: Would like this to be a global substitution in a future version of SunSpider.

Technical details: the design of WREC. There are two main ways to implement regular expressions: using a backtracking matching engine, or by transforming the regex to a finite automaton (NFA, aka “state machine”), which does not backtrack. Most Perl-type regex engines, including both SpiderMonkey’s and WREC, follow the backtracking design. I don’t know the exact history of that choice, but at present it is much easier to implement features like group capture and backreferences in the backtracking design. Also, although some regexes scale only if implemented as NFAs, my tests suggest that many simple regexes, including those in SunSpider, are faster with backtracking.

As of this writing, WREC’s implementation strategy is dirt simple (which is a good thing). There are no transformations or fancy optimizations on the regex. WREC simply generates native code that directly implements the backtracking search. Thus, within a single match operation, there are no function calls, no traversals of regular expression ASTs, and few option tests, so almost all of the overhead is eliminated.

WREC’s code is very easy to read, so if you want to know exactly how it works, just read it in WREC.cpp. It’s also great example code for anyone implementing a compiler for a simple language like regular expressions. The basic plan is to parse the regular expression with functions named things like parseDisjunction (the | operator). Those functions directly call functions like generateDisjunction that generate the native code using the same assembler that the call-threading interpreter uses. There’s also the oddly named “gererateParenthesesResetTrampoline”. Inexplicably preserved typo, or watermark to detect copying of WREC code?

An an example, for the regex /a|bb/, the generated native code looks something like this:

curPos = start
if curPos >= textLength goto CASE2
if text[curPos] != ‘a’ goto CASE2
curPos += 1
goto DONE_MATCHED
CASE2:
curPos = start
if curPos >= textLength goto DONE_FAIL
if text[curPos] != ‘b’ goto DONE_FAIL
curPos += 1
if curPos >= textLength goto DONE_FAIL
if text[curPos] != ‘b’ goto DONE_FAIL
curPos += 1
goto DONE_MATCHED

As you can see, the generated code does almost nothing beyond what is absolutely needed to implement the matching correctly. Also, curPos, textLength, and the start address of text are all kept in fixed registers. Based on a variety of microbenchmarks, I suspect that the critical path for this code is simply the unavoidable work of reading all the characters in the text, so the code is near maximum efficiency.

Backtracking in WREC. One tricky part of a backtracking regex engine is the backtracking. Consider matching the regex /a*a/ against the text ‘a’. First, the engine will greedily match /a*/ with ‘a’, leaving a remaining regex fragment of /a/ and a remaining text of the empty string. When the engine tries to match /a/ against ”, it fails. Now it must backtrack, specifically by going back to the /a*/ step and finding the next longest match. The next longest match is ”, leaving a remaining regex fragment of /a/ and a remaining text of ‘a’, and the engine will finish with a successful match.

I was curious how WREC implemented backtracking. For example, with /a*a/ and a text of a million ‘a’s, I wanted to know if WREC saves all one million shorter matches in case it needed to backtrack them (which seems like a waste of space), or if WREC recomputes them if needed (which could be slow if a lot of backtracking needs to be done). I found that WREC generates code like this for a quantified regex a* (I reordered things to make it easier to read, but the logic is the same):

// First, greedy match
repeatCount = 0
while (next char is ‘a’) {
curPos += 1
repeatCount += 1
}
while (true) {
// Now, try to match the rest
save curPos
… code to match the rest of the regex …
if (rest of regex matches) goto DONE_MATCHED
// backtrack
restore curPos
if curPos <= 0 goto DONE_FAILED
curPos -= 1
repeatCount -= 1
// loop around and try to match the rest from here

}

So the answer to my question is that WREC doesn’t save the intermediate matches, it just keeps track of the length of the text matched against /a*/ (repeatCount in my code) and decrements the position and count in order to backtrack. This is great, because it uses only two words of memory for state (instead of N million) but can find the next longest match very quickly (just 4 instructions).

But this left me wondering, how does WREC handle something like /(a|bb)*x/, where backtracking needs to go back a variable number of characters? Here’s the code that replaces ‘repeatCount-1′ with the backtracking logic needed here:

void GenerateParenthesesNonGreedyFunctor::backtrack(WRECGenerator*) {
// FIXME: do something about this.
CRASH();
}

Interesting. This function isn’t called anywhere, so it’s not a crashing bug in WebKit. The reason it’s not called:

bool WRECParser::parseParentheses(JmpSrcVector&) {
// FIXME: We don’t currently backtrack correctly within parentheses in cases such as
// “c”.match(/(.*)c/) so we fall back to PCRE for any regexp containing parentheses.
m_err = TempError_unsupportedParentheses;
return false;
}

Very interesting. And on my machine, dna-regexp.js runs in about 43 ms (in jsc, the SFX command-line shell), but if I add an empty pair of parens to each regex in the test, it takes 206 ms. Breaking out the time for regexp-dna.js regex matching only (i.e., not including the string building and replace calls), with the parens SFX is 15% slower than SpiderMonkey’s, which is actually still pretty good.

It just so happens that SunSpider’s regexp-dna.js doesn’t use parens in any of its regexes.

But the web does, of course. To find out how common parens are, I instrumented a Firefox build to record regexes being run on web pages and opened some popular pages. Firefox processed 830 unique regexes, of which 287, or just about 1/3, contained parens.

String.replace. The other major portion of regexp-dna.js tests String.replace. Again, SFX is about 5x as fast as SpiderMonkey and I wanted to know why. The calls to replace in regexp-dna all look like this:

dnaInput.replace(’B', ‘(c|g|t)’, ‘g’)

This is supposed to replace all occurrences of ‘B’ with ‘(c|g|t)’. The search string is just ‘B’, so no regex processing is needed, just simple string searching. So I started reading WebKit code in StringPrototype.cpp. I saw that there is a special case fast-path for a non-regex search string, which makes sense. In fact, it’s so special and so fast that I couldn’t even find any code to implement the ‘$&’ replacement string syntax required by the ECMAScript standard. Testing SFX:

> ‘aba’.replace(/a/, ‘x$&x’)
xaxba // good: $& is the matched text
> ‘aba’.replace(’a', ‘x$&x’)
x$&xba // oops

I also couldn’t find out how the ‘g’ flag was implemented, or even any code to read the third “flags” argument at all. Testing SFX again:

> ‘aba’.replace(’a', ‘x’, ‘g’)
xba

Hmmm. So the ‘g’ flag is not implemented at all. That’s OK, it’s not part of the ECMAScript standard, it’s just a SpiderMonkey extension. But that means the benchmark contains a flag that just serves to make SpiderMonkey do a lot more work than SquirrelFish. With that flag removed, so that both programs are doing the same replacement operation, the performance difference goes from 5x to 1.5x. Hopefully SpiderMonkey will have a similar fast path soon, which should bring that to parity.

But I think it’s pretty clear regexp-dna.js has a bug. Either the ‘g’ should be removed, or it should be recoded to do the global replacement in vanilla ECMAScript.

Final words. Based on its performance on the regexes it does handle, WREC is indeed an awesome design. regexp-dna.js, however, is flawed and exaggerates SFX performance.

We could use nanojit to make a regex compiler for SpiderMonkey that would perform as well as WREC. But I don’t know if it’s worthwhile yet. Regex performance is much less important for today’s web than it is for SunSpider–I hope to link to a report on that in a future post.

Static analysis newslets

I’ve been in interpreter-land lately, but I do help out a bit with static analysis projects when I get the chance. So I’d better post an update here on some interesting developments that haven’t been publicized yet.

First, Keith Schwartz (one of our interns) is making great progress on automatic const-correctification for Mozilla. The basic idea is to put const on as many declarations as possible without breaking the code or introducing casts. Keith has devised an algorithm based on type inference. Currently, he’s working on the Treehydra code to extract the type constraints from code. Because C++ is so complicated, there are a ton of details, and he’s had to master the insanity of GCC intermediate representations of pointers to member functions and calls through them. (If by chance, any readers know of a non-insane way to access them in GCC, let us know.)

Second, the Cairo folks were kind enough to give me an hour to talk about static analysis with the Hydras at their recent meetup. They already had 2 static analysis applications in mind. One was ensuring that internal-only return codes aren’t returned from the public API. The other was checking that integer and fixed-point values, both represented by C ints, aren’t mixed together. I think both of these can be formulated as type checking or inference problems.

I’m hoping we can extract a generic Treehydra type inference library from Keith’s code for the Cairo problems and others. One issue here is that Cairo is C, while Keith’s been using the C++ ASTs for his work. I don’t even know if Treehydra can read C ASTs at this point, but I think Treehydra’s design makes extending to C not too hard.

Chrome: First Look

I’ve been trying out Chrome on my Vista machine at home, and I read a few hundred comments on Chrome last night on Digg. Here are my first thoughts, just as a browser user and an engineer.

Responsiveness. The most noticable things right away are that the UI is minimalistic and very responsive. The screen responds promptly to clicks, and UI elements appear, disappear, and move around very quickly. Responsiveness makes any tool (or toy) more enjoyable to use. So I like that a lot.

AFAIK, Chrome’s UI is straight C++, so that plus the minimalism should make it very fast. Firefox, of course, has a bigger UI coded in XUL/JS, which gives FF more features and excellent extensibility. I would never want to see FF give up extensibility, but I would sure like a snappier FF. I have no idea what that would take; speeding up JS should help, perhaps some sort of XUL tuning or overhaul is needed as well.

funny picturesProcess Isolation for Tabs. Less technically known as process-per-tab. This looks great. Diggers love the idea as well. Personally, I want it so that one bad web page or Flash app doesn’t hang my whole browser. That seems to be what the Digg users like best about Chrome too. There’s also reliability-one really bad page won’t crash the whole browser, although FF crashes are rare for me anyway and the state restore option makes crashing not too painful. And security, so that a maliciously bad page can’t overwrite memory on anything else. I haven’t used Chrome enough yet to see these situations come into play, so I don’t know it will work in practice, but for now I’ll assume they got it right. There’s still the question of whether the Windows schedulers give enough isolation to maintain performance, too.

Anyway, I want it.

There is a tradeoff: Chome seems to use more memory, at least with several pages open. (I tend to run with 20-80 pages open at a time.) Chrome actually tracks the memory nicely. Type about:memory into the location bar, and you get a summary of memory for all your open browsers, including FF if it’s running. Very cool.

<technical-gobbledygook>Memory consumption on modern OSs is complex enough so that it’s difficult even to define clearly. Chrome’s “task manager” (press Shift+Esc) is reporting the private working set size. This is the same number Vista Task Manager displays by default. The private working set size is the amount of physical RAM allocated for the exclusive use of a process. (The full working set includes physical RAM accessible to to the process but shared with others. The term “working set” comes from the OS’s allocation strategy, which tries to allocate just enough physical RAM to cover the virtual memory the process touches frequently, which is called the “working set”.) about:memory also shows the virtual memory allocation (I think Windows calls it the “commit size”), which is the “true” (ignoring a few complexities, I’m sure) amount of memory used by the process. I think that for an active app, the total allocation is more relevant, because either it will be in memory, or frequently paged in and out, bogging everything down. For paused background stuff, the working set can be more important, but if it’s paged out, the total size will determine how long it takes to come back up when you ask for it. On my system, Chrome was using more memory by both metrics, but I didn’t use Chrome heavily enough to see any serious paging or thrashing, so I don’t know how it plays out in practice yet.</technical-gobbledygook>

The extra memory consumption might make Chrome less suitable for mobile devices. So process per tab should be an option for FF, but not required. Also because not everyone will want it on their non-mobile devices.

Ideally, I think process, thread, and DOM tree (i.e., tab) should be orthogonal concepts in the browser platform. The DOM tree is a unit of visual display and window control. The process is a memory protection domain. The thread is a unit of sequential execution. The user and the browser would control the policy of how many processes and threads are created and how they are assigned to tabs.

As an example of what you could do with this, you might have a default policy where each tab is a process, with each separate script or plugin being a separate thread inside that process. Thus, the tab is protected from other tabs, and its dynamic elements can run concurrently. But now say that one of those scripts loops. The other scripts might still be running OK, but with less CPU because of the looping script. If you want, you could shut down just this tab, and leave the other tabs unchanged. It’s dangerous to kill just one thread, because it can leave the memory inside the process inconsistent. But at worst, only this tab would crash. If it did, you could reload the page with a policy giving each dynamic element its own process. At this point, you could kill just one element on the page while leaving the rest running.

A more exotic example would be to assign two processes to one tab. This seems weird, but it would be possible if the 3 concepts were orthogonal. One use would be to allow two other elements (tabs or scripts) to interact closely with a element. Interprocess communication is expensive, but as long as the other two elements use the shared element one at time, we can just map the shared item into both processes, so they can access it relatively quickly.

Of course, that’s a lot of work. I hear there’s also some problem where the MacOS Core Graphics APIs make it harder to implement multiple processes for one browser window than it is on Windows.

Awesome Bar and Places. Chrome has some version of FF’s Awesome Bar and Places. Which is a good thing-at this point I would find it painful to use a browser that didn’t. I think the location bar has some Googlyness in it too, like Google Suggest or something. I started typing the names of some of my favorite blogs, and it got the popular ones after a few letters even though I’d never visited them before in Chome. That was pretty nice, but I imagine after you’ve used the browser for a while, you have visited your favorites anyway, so it won’t matter too much. But I use and love FF’s Awesome Bar so much, I think I would like having search/suggest even more integrated with it. It’s a small thing, but it could make the browser just that little bit more responsive and direct.

So far I’ve found the bookmarks interface kind of weird. It bothers me that I either have no bookmarks menu at all, or else an ugly toolbar strip across the top of the browser. But I think it’s good they’re trying something new-I think Places is great but will continue to improve. Personally, I use the Awesome Bar for all navigation to a single site. I only use the bookmarks menu to open a group in tabs (e.g., morning blog reading), or to bring up a group in the Ctrl+B view and click on various items in turn (e.g., Youtube playlist). So there’s probably some mystical third design (not to mention 17 existing FF extensions) that would be better for me.

I’m really not excited by tabs on top of the window, or the thumbnail view of open pages. That stuff doesn’t work when you have 26 tabs open.

V8, blah blah blah. V8 is the other hot topic on Digg. I’ve heard Zimbra runs really well on Chrome, but I personally haven’t done anything with it as a user where I really noticed a difference. TraceMonkey vs. V8 has been covered elsewhere, so I’m not going to bother with that. Plus all these VMs are pretty early, so I’m not terribly excited about day-by-day benchmarks.

What I am immediately interested in is what techniques the V8 team actually used, how it relates to my inline threading project, and what ideas I might be able to steal from it. But I’ll have to read the code and report on that another day.

Inline threading, TraceMonkey, etc.

It’s been a long time since I’ve posted here-I wanted to post some interesting results about speeding up SpiderMonkey using inline threading, but it turned out to be really hard and took a long time to get close enough to “interesting results”. At last, my patch is good enough to run SunSpider, and runs it 8% faster than baseline (non-tracing trunk SpiderMonkey from a few weeks ago), 10-20% faster on favorable benchmarks, and 48% faster on 3bit-bits-in-byte. So that’s pretty cool.

Of course, 8% looks pretty puny next to the huge gains of TraceMonkey (congrats to those guys, by the way). But I’m assured that interpreter speedups still count, so I’m chugging along. (Side note: inline threading speeds up SunSpider’s access-fannkuch benchmark by 22%, which has proved difficult to optimize with tracing.) (Side note 2: I’ve been told that TraceMonkey will hugely speed up our static analysis scripts, maybe 10x or so, which is great news.).

Insane, gory detail on inline threading, related optimizations, and detailed performance analyses can be found in bug 442379. I thought I’d go over the key ideas in this post:

Inline threading. Basically, this is yet another interpreter opcode dispatch technique. I previously wrote about opcode dispatch, concluding that direct-threading, in which the opcode is a target address, and the code to start the next op is a single indirect jump instruction is the ultimate efficient dispatch mechanism. It turns out I was wrong.

Inline threading is the “best”, because it gets dispatch down to 0 instructions. The idea is to create a buffer, and copy into it the native code for each opcode to be executed. For example, for a function body like “return a+3″, the opcodes are: JSOP_GETARG (0), JSOP_INT8 (3), JSOP_ADD, JSOP_RETURN. To inline thread this, we create a buffer and fill it with native code like this, using memcpy:

code for JSOP_GETARG
code for JSOP_INT8
code for JSOP_ADD
code for JSOP_RETURN

It’s like a really crude form of JIT compilation.

To run the function, we just jump to the start of the buffer, and then it all runs, with no further dispatch code. I’ve found that an average SpiderMonkey op executes about 35 instructions, so inline threading removes the 4 instructions for indirect threading, reducing this to 31, and should speed up SM by about 11%. Nice!

For more info on inline threading, see SableVM and this paper.

Hard Stuff. The only problem is that what I just described doesn’t actually work. For one, the compiler doesn’t necessarily compile each SpiderMonkey op handler into a single block of code. In fact, it usually reorders things a bunch, to help with code locality and reduce the number of jump/branch instructions executed on hot paths. So those are too hard to inline. (It could be done by disassembling parts of SpiderMonkey, analyzing the results, and doing code layout again, but that’s a bit much.)

Also, because most jump, branch, and call instructions express their targets using a relative address (an offset from the IP of the following instruction), any jump outside our little inline-threaded buffer becomes a jump to hell. So those would all have to be identified and patched, again with a dissassembler.

In general, small ops that don’t call functions, generate errors, or have much internal control flow can be inlined, but everything else can’t. Fortunately, “small ops” include some really common ones like JSOP_GETVAR and JSOP_POP, but the technique can’t be used without some special treatment for the big ops.

Call Threading/Context Threading. For big ops, I used something called call threading or context threading. Call threading has been used to produce big speedups on some interpreters, but turns out not to help at all for SpiderMonkey. But it can handle big ops in an inline-threading system, and after a lot of work, I at least got it to not slow down SpiderMonkey.

The idea of call threading is to create a native buffer, but fill it with calls to the opcode handlers instead of copying the whole handlers in. With the previous example, it gives you:

call &JSOP_GETARG
call &JSOP_INT8
call &JSOP_ADD
call &JSOP_RETURN

Those are x86 call instructions. For this to work, the opcode handlers have to end with ‘ret’ instructions. This means dispatch is 2 instructions, which is better than indirect threading’s 4, and I think better even than direct threading because that’s really 3 instructions if you count getting the op, which I should have before. Also, these call and ret instructions are highly predictable (99.9%+ prediction rate), unlike the indirect jumps used by direct and indirect threading, which are very unpredictable. Since a mispredict is very costly (~16 cycles on Core 2, I think), this gives a big speedup, 30% or so on some interpreters.

Except on SpiderMonkey, where unfortunately it doesn’t work and also slows things down. This was very frustrating.

The reason it doesn’t work is that when you execute a ‘call’ instruction, you push the return address onto the stack, decrementing the stack pointer ($esp) by 4. The opcode handler will then crash if it calls a function. There are actually several reasons why but the most important is that on most systems, the stack pointer has to be 16-byte-aligned when you call a function, and that -4 puts it off.

To make it work, I add extra code to unpush the return address right after the call, and then unpop it back right before the return. This works, but now we’re up to 4 instructions, so we haven’t saved any instructions over indirect threading. (I would love a better solution, but I couldn’t think of one.)

Next, it turns out that in SpiderMonkey, due to clever optimization by Igor & Co., the branch prediction rate is already excellent in practice: 80-100% on benchmarky-type programs. (The trick is make sure there is a separate indirect jump going out of the end of each opcode, so the processor can predict each one independently. Also, the Core 2 indirect branch predictor seems very smart.) So branch mispredicts are only costing an average 0-3 cycles per op. SpiderMonkey takes about 28 cycles per op, so this gives an estimated speedup of 0-12%.

Last, the really tragic thing is that the changes I made to make SpiderMonkey do call threading make GCC shoot itself in the face. For reasons I don’t entirely understand, GCC compiles the op handlers “differently”, so they run at least 5% slower. It took some doing to figure out how to keep the slowdown down to 5%, which makes call threading about even with indirect threading. That’s pretty disappointing, but at least it means I can call thread the big ops without penalty (on average), and then inline the small ops, getting some speedup. With the example code, I generate:

code for JSOP_GETARG
code for JSOP_INT8
call &JSOP_ADD
call &JSOP_RETURN

(The optimizer issue is incredibly arcane, but I think I traced the problem to an optimization pass called “gcse2-post-reload”, which is really partially-redundant-load elimination after register allocation. Any kind of partial redundancy elimination (PRE) can slow code down. The slowdown effect should be mitigated when using profile-guided optimization (PGO), but I couldn’t get GCC PGO to work on SpiderMonkey to test that theory.)

For more on call threading, see this paper, keeping in mind that the results would probably differ on Core 2 because of its presumed better indirect branch predictor.

Inlining-enabled optimizations. Inline threading some stuff and call threading the rest doesn’t yield exciting gains for SpiderMonkey, maybe 2% overall and 5-20% on the “good” benchmarks. (I guess that’s not too bad, the 2% overall just seems really weak.) But inline threading enables a bunch of other cool optimizations that could speed things up more. I’ve only done the easy ones so far, and that got the 48% speedup on 3bit.

Easy optimization #1 is to stop updating the SpiderMonkey PC. The call and inline threaded code tells what op handler to run next, so the PC is unnecessary. Actually, some ops do use the PC, and I’d like to make them not do that someday, but in the meantime I can just generate code to set the PC in my native code buffer:

mov 0, [pc]
code for JSOP_GETARG
mov 2, [pc]
code for JSOP_INT8
call &JSOP_ADD
mov 5, [pc]
call &JSOP_RETURN

It looks like I only saved the PC update for JSOP_ADD, but it’s better than that because the standard code has to load the PC, increment it, and then store it back, which I’ve replaced with just one instruction. The PC optimization is relatively easy and saves 0-3 instructions per op, which means a 0-10% speedup by itself. And it actually works.

“Easy” optimization #2 is to specialize certain opcodes. Take JSOP_INT8 as an example. This op pushes an integer onto the interpreter stack. That’s 3 or 4 instructions, to manipulate the stack pointer and store a value to it. But it also has to get the value out of the opcode stream and convert it to a jsval (the tagged value type of the interpreter), so it’s actually 9 instructions. (And JSOP_INT32 is much worse because it has to fetch 4 bytes from the opcode stream and shift and or them together.) But if we’re inlining JSOP_INT8, we can just inline the 3 or 4 instructions using the actual value, a version of JSOP_INT8 “specialized” for the given value. I do the specialization by writing the op in assembly code with a dummy value (0xdeadbeef), finding the location of the dummy as part of interpreter startup, and then patching that location as I inline. This seems crazy, but all the other design options seem crazy in their own ways, so I went with it for now. With specialization in play, the example looks like this:

code for JSOP_GETARG specialized for slot 0
code for JSOP_INT8 specalized for jsval for 3
call &JSOP_ADD
mov 5, [pc]
call &JSOP_RETURN

Because JSOP_GETARG and JSOP_INT8 use the PC only to get their arguments, which specialization bakes in, we get to remove some more PC updates too.

Compiler nitpickery. In itself, all this stuff works, and it’s been used in various research projects before, and probably some other interpreters by now. But this particular implementation of it depends a fair amount on what the compiler does: at least (a) that the compiler doesn’t reorder small ops too insanely, (b) that you can take the address of labels (to get the start and end of op handlers), (c) patching in some ‘ret’ instructions at runtime (needed to solve an obscure problem I didn’t feel like discussing here), and (d) that the compiler doesn’t deoptimize too badly when you do all this. That seems possible, but it’s not clear yet that this will play well across different versions of GCC and ICC. (Crazy side note: in my tests, GCC 4.2 on Mac regresses SpiderMonkey SunSpider by 5% vs. GCC 4.0.)

SquirrelFish

If you’re reading this, chances are that you already know about SquirrelFish, Appl/WebKit’s new Javascript implementation. Early tests show SquirrelFish to be 60% faster than WebKit 3.1 JS, 46% faster than Spidermonkey and 52% faster than TT (Tamarin Tracing) on SunSpider.

Clearly we have some work to do. The plan is to improve TT so that hot loops run highly optimized native code; TT’s optimizer is in the early stages, and we think there’s a lot of room for further optimization. For example, as explained in my previous posts, when TT jumps from one trace to another, it has to save the interpreter state to a standard format and then reload the state on the new trace, and no cross-trace optimization is possible. Ideas like trace folding have potential for a big improvement here.

But I still wanted to take a quick peek into why SquirrelFish’s interpreter is so fast. The announcement touts 3 key design features:

  • Using a bytecode interpreter.
  • Using direct-threaded interpreter dispatch.
  • Using a register-based bytecode interpreter.

Bytecode interpreter. The old WebKit JS was based on an AST walk, which is explained in some detail in the announcement. It’s well-known that bytecode interpreters are faster, so it’s no surprise Apple made this change. Spidermonkey and TT’s interpreter have always been bytecode interpreters.

Direct-threaded dispatch. Interpreters tend to spend a lot of time dispatching bytecode operations. Direct-threaded dispatch is a technique for efficient dispatch.

The obvious way to write a bytecode interpreter is with a switch statement inside a loop:

void run(Bytecode *ip) {
for (;;) {
switch (*ip++) {
case OP_ADD: …
case OP_JUMP: …

Each iteration runs one bytecode instruction. Each case of the switch handles one instruction type. It really doesn’t look like there’s any room for improvement here unless you look at the assembly code generated for the switch dispatch (from a tiny test interpreter I wrote today, comments added by me):

# ip is in %edx
# Check that switch expression (%edx) is in table range
cmpl    $9, (%edx)
leal    4(%edx), %ecx
ja    L26
# Look up case address offset for (%edx) in table (L37)
movl    (%edx), %eax
movl    L37-”L00000000006$pb”(%ebx,%eax,4), %eax
# Add base address to offset
addl    %ebx, %eax
# Indirect jump to computed address
jmp    *%eax

The basic idea is to keep a table of relative offsets to the cases, and then jump using that offset. Because the switch expression could evaluate to anything, the compiler must first generate a range check, so that if the switch expression doesn’t map to any case, the program leaves the switch instead of crashing unpredictably.

But in reality the range check is useless, because the interpreter can control what bytecodes actually appear, so this is 3 wasted instructions. Also, the lookup and base+offset address computation seems kind of clunky. I’d prefer dispatch code something like this:

jmp    *%edx

This is direct threading. In principle, the idea is simple: the instruction code (e.g., OP_ADD) is the address of the case target code. (In the basic design, instruction codes are integers.) Then, this jump is all you need for dispatch.

Coding up direct threading is weird; normal C compilers don’t know how to do it. But it can be done with GCC’s computed goto extension (see the paper on direct threading from the SquirrelFish announcement). See also my tiny interpreter.

I believe TT and Spidermonkey use an intermediate design called indirect threading, which gets most of the speedup of direct threading, but allows integer opcodes. The opcode is an index into a table of case target addresses. So the dispatch code has to look up the case target, then jump, something like:

leal   TABLE(%edx,4), %eax
jmp   *%eax

Not too bad. I have no idea how significant the difference between direct and indirect threading is in practice, but even a few percent speedup would be great.

Register-based interpreter. “Traditional” interpreter design keeps all operands and intermediate values of expressions on a data stack. (This is completely different from the call stack.) I can think of a few design reasons why this might be, but I really don’t know why it’s been done this way. Anyway, with a stack, bytecode for a line of code like “x = a + b + c” looks like this:

LOAD a  # push a onto stack
LOAD b
ADD      # pop 2 elements, add them, push the result
LOAD c
ADD
STORE x

One nice thing about this is that the algorithm for generating this bytecode is very simple, so it’s not a lot of work to code and runs fast. But there’s a lot of code just to manipulate the stack here, and it might be nice to avoid it. (Real stack-based programs have even more stack manipulation, with DUP, SWAP, ROT, etc. operations.)

A register-based design avoids the stack by using a fixed array of “slots” or “registers” for operands. (These registers are completely different from assembly code registers.) The register-based version of the above line of code would be something like:

ADD temp, a, b   # temp = a + b
ADD x, temp, c

Keep in mind that in the register-based bytecode, where I have “x” it would actually have something like “0″, if x had been assigned slot 0 in the register table. Note also that the bytecode generator has to analyze the code a bit to decide how many registers are needed and assign them to the variables and temporaries.

Each design has advantages and disadvantages:

  • Bytecode code size. Stack-based bytecode programs tend to be smaller, because the operands are implicit. But the register-based program gets to omit the stack manipulation instructions, so the stack-based bytecode should be only a little bit smaller. The memory savings shouldn’t matter in most environments, but reading less bytecode will save the interpreter a bit of time.
  • Bytecode instruction count. As shown above, a register-based program will have fewer bytecodes. Fewer bytecodes to execute means faster run times, if it takes the same length of time to execute a bytecode in each design. Which it doesn’t:
  • Operand access. To access operands, the stack-based interpreter just reads and writes from the top of the stack or very near it. A register-based interpreter needs a two-step process: first it has to get the address of that register, then read or write it. But the register-based version doesn’t need to adjust the stack. Still, it seems like more work per instruction for the register version: for ADD, the register version needs to compute 3 operand addresses, while the stack version just one (the new stack pointer).

Overall, when you look across instructions, I think the total amount of operand addressing computation is about the same. The stack version distributes it across more instructions. The stack version might be able to save a few by having special instructions like LOAD0 (to load slot 0, needing no computation). But the register version still has few instructions, so fewer dispatches. With direct threading, the dispatch is fairly efficient, so this is less important, but an indirect jump is still usually pretty expensive compared to a normal instruction because of the high branch mispredict probability.

Microbenchmarking. I wrote a tiny interpreter that runs a bytecode program for an empty for loop using doubles as numbers (as in JS) to test out these things for myself. I tried 4 design choices: direct threading vs. switch and stack vs. register. The times for 10M iterations in milliseconds:

Stack Register
Switch 230 90
Direct 105 55

Here you see huge differences. The reason the differences are so big (far bigger than WebKit got) is that my tiny interpreter’s opcodes are very simple. In a real JS interpreter, the code to run for the average operation is a lot longer, so dispatch overhead is a much smaller fraction of total time, and dispatch overhead is the main thing saved by direct-threading and registers.

Note that the speedup of register vs. stack is smaller in the direct-threaded case, as I predicted above. But not that much smaller, because the dispatch really is expensive compared to everything else in this simple interpreter.

You can get my microbenchmark code here.

Tamarin Tracing Internals V: Running Compiled Traces

Whew. Reading all this TT code is fascinating, but also tiring, hard work. Anyway, I’ve hit almost all the high points by now, and I’ve traced out the JITting process all the way from ABC bytecode to native compiled traces. The questions I have left are about how traces actually get run, plus some related questions I’ve avoided about what side exits really are and how they work.

Running Traces. The initial entry point into compiled code is back in Interpreter::loopedge, the same method that initiates tracing (see Part III). loopedge always checks to see if there is a compiled trace for this loop header. If so, it executes the compiled trace. (Look for the label callfrag.) Here’s the call:

lr = (*u.func)(&state, 0);

The first argument is a pointer to the interpreter state. I think the second is something used only in debug modes. The result is a pointer to GuardRecord, which is defined in Assembler.h. The comment reads: “These objects lie directly in the native code pages and are used to house state information across the edge of side exits from a fragment.”

The key member of GuardRecord is Fragment* target, which gives the destination fragment (loop header) of the exit. If the destination is not a loop header (target == 0), the destination will be made into a fragment so that it can be traced if it becomes hot. The destination fragment will then get its count incremented, and if it is now hot, tracing starts immediately.

Trace Exits: LIR. I need to back up a bit in order to fully understand how trace exits work.

During trace recording, branch instructions (e.g., IL LBRT) require special handling. The trace is linear, so we just generate straight-line LIR according to the branch that was actually taken. This is fundamental-we are guessing that since we took a certain branch now on a hot trace, we’ll probably take the same branch many times more, so the program will run fast if we generate straight-line code for this case. But of course, on any future execution, we’re not guaranteed to take the same branch again, so when we pass this point, we have to do the check again and exit the trace if the we get the opposite result. The check is called a guard, and the exit is called a side exit. Here is an example from IL->LIR trace generation debug output:

T 11D6BE  BRF   -8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10
35 imm   #0
36 eq    33,#0
GG: ip 11D6C2 sp 100E0B4 rp 100616C
45 xf    36 -> 11D6C2

The IL instruction is BRF, “branch if top-of-stack is false”. In this case, the top of the stack is d:10, i.e., 10.0, so the interpreter doesn’t take the branch. But we’re more interested in tracing. Tracing of branch instructions is implemented by Interpreter::jump_if. First, jump_if emits LIR for the test, specifically to test if the top of the stack is zero. This is the “imm #0″ and “eq 33, #0″.

Now comes the scary part, calling Interpreter::guard. I would tend to consider the effect of this function more to be generation of a trace exit, but it’s called guard, probably because it generates the branch instruction for side exits. But it is also used for LIR_loop instructions, which don’t even really have guards.

Naming questions aside, for side exits, as in our example, the first thing guard does is print out the “GG” line (if in debug mode). The rest of the line shows some interpreter states, and is probably helpful if debugging TT. Next, guard generates a SideExit structure (Fragmento.h) inline with the LIR to describe the exit. The SideExit records:

  • The interpreter state (frame, stack, return, and instruction pointers) as offsets from the interpreter state at the start of the trace.
  • The trace.
  • The target of the exit as a fragment, i.e., the (potential) start of another trace.
  • The current ActionScript call depth.

This records interpreter state that is not otherwise encoded. When I went over LIR generation and optimization, I realized that the LIR contains all the store instructions needed to maintain the current interpreter stack data. (Some are optimized away in the dead store elimination pass.) But the LIR doesn’t update the interpreter state’s fp, sp, rp, or ip. At every exit we might be going back to the interpreter, so we need to recreate the full interpreter state. The SideExit contains the necessary information.

After writing the SideExit, guard generates an LIR branch instruction. In our example, we should exit if the test is false, so we generate an LIR_xf. Note the gap in instruction sequence numbers-this is because of the space taken up by the SideExit.

guard handles LIR_loop exits (jumps to the trace header) a little differently. Instead of writing a SideExit, guard emits LIR instructions that directly update the interpreter state. I’m not entirely sure why this is. I also think that in most cases, no adjustments are required, because the interpreter stack size and types should be the same every time control pases a given point. It may have something to do with recursion.

Trace Exits: Native Code. A trace exit in LIR is a LIR_xf, LIR_xt, LIR_x, or LIR_loop. These all have cases inside Assembler::gen. For xf, xt, and x, the assembler calls asm_exit to generate exit target code, then generates native JMP/JE/JNE/Jx instructions that branch to the target. For loop, the assembler just generates a JMP instruction.

asm_exit is hard to understand, but I think I have the gist of it. The key action is calling nFragExit, which generates the exit target code. This code is generated on a separate page that is allocated for trace exits at the beginning of assembly (_nExitIns is the current position). nFragExit takes the SideExit struct as its argument. The SideExit gives the target of the exit as a Fragment, which is a loop/trace header that may or may not have a compiled trace. Reading backwards, nFragExit generates code to:

  • Update the interpreter state using the offsets recorded in the SideExit.
  • Ensure that param 0 of the trace is stored in the standard param 0 argument passing register. This is needed if the exit code is ever set up to jump directly to another trace-that trace will expect param 0 in the usual place. (Param 0 is a pointer to the interpreter state.)
  • Return a newly created GuardRecord (Assembler.h). The GuardRecord is the native code equivalent of a SideExit. Like SideExit, it is stored inline with the code (the native exit code). The GuardRecord is created by placeGuardRecord and holds the current fragment, target fragment, and call depth.
  • Restore the ISA stack pointer (x86 esp).
  • Jump to the trace epilog.

The trace epilog, by the way, is the same for every trace, and on x86 it pops the ISA frame pointer (efp; twice, because it is pushed twice for some alignment reason) and returns. This is just the “second half” of the standard C return-from-function sequence.

The exit code can be summarized as updating the interpreter state and then both doing the “first half” of return-from-function and preparing a function call to another trace. That way, the ending JMP can be pointed at either the main exit to the interpreter, or made to jump directly to another trace, and either works fine.

Another detail is that if the target of the exit has already been compiled to native code, instead of generating a jump to the trace epilog, nFragExit generates a jump directly to the target trace. (It also skips creating the GuardRecord). This is nice because then the code doesn’t have to return to the interpreter at all, it just keeps executing native code.

asm_exit wraps the call to nFragExit with a pair of calls to swapptrs. This is a macro defined in Native*.h that swaps the pointer to the current position in the native trace code buffer (_nIns) and the current position in the native exit code buffer (_nExitIns). This is just so the macros that generate code can always refer to _nIns as the place to store native code.

Finally, asm_exit does a bunch of fancy register allocation stuff. I don’t completely understand it, but I think it’s just needed because the register allocation algorithm is a greedy algorithm for straight line code, and it needs a little tweak when there is a branch. It looks like asm_exit first saves a copy of the allocations and then clears them out so the exit code area has a clear set to to work with, as it should (the only data passed out of the exit via registers are the return value and param 0, which the exit code does set up). Once nFragExit returns, the register tracker now has some allocations for values that are needed in the exit code if any. At this point, mergeRegisterState is called with the current register tracker and the saved tracking data to fix everything up. The fixing is basically that if the exit code expects, say, ecx, to contain a certain value, and the main trace has a different value in ecx, a move needs to be generated at the start of the exit code to get the exit code’s value into ecx.

Reentrancy. One last thing I want to think about is the issue of reentrancy. We’ve been told that TT isn’t reentrant. Specifically, a native method (implemented in C++) can’t call back into ActionScript. But I never clearly understood why this is. I’ll probably be wrong about half of this: experts, please jump in and correct.

The problem could exist at multiple levels, but I think the simplest issue is that the interpreter isn’t reentrant, for the usual reasons of having interpreter-global data structures. For example, a reentrant interpreter would need to have a mechanism for recording the reentry on some sort of stack. Also, if the native method interacts with the Forth stack, the system would need to be very careful about managing that. None of this seems fundamental, just tricky and not done yet.

The other question is what happens to tracing with reentry. One possibility is to stop tracing when entering a possibly reentrant native method, and then possibly start tracing when a native method calls back into ActionScript (i.e., consider a reentry to be a fragment header). This seems like it would work. Another possibility is to allow some declarations on native methods to describe their effects on the interpreter state, so that tracing could actually continue through the reentrant calls. Such a mechanism sounds hard to use, though, and would probably be used only on really important methods in a few places, if at all.

Tamarin Tracing Internals IV: Trace Optimization

In part III, I went over how TT generates LIR traces. Now, I’m going to look into the trace optimization and machine code generation process. The code for this is mostly in the nanojit/ directory.

Keep in mind that a trace is always straight-line code in SSA form. This makes optimizations easier to implement, so it has a big effect on the design. By the way, a lot of the material on SSA is confusing, and also goes straight into a lot of complexity that’s not needed for TT, so I’ll give a quick explanation here.

SSA. SSA stands for static single assignment, but don’t bother trying to parse that. It just means that each virtual register in the trace appears on the left-hand side of exactly one assignment statement. This is automatically true for a TT trace, because the virtual register assigned to is implicitly the sequence number of the instruction. Note: locations that are not virtual registers, such as a slot on the data stack, may be assigned to multiple times; the SSA property in TT holds only for virtual registers.

The advantage of SSA is that any time you see a virtual register, you can just look at its one assignment statement, and you immediately know what it was assigned from, what kind of operator was used, whether the operands were constant, etc. Without SSA, there are multiple possible assignments, and an optimizer has to try to discover which assignments can actually reach the current point and then summarize their effects, which is slower and less precise.

Constant Folding. TT performs constant folding as part of LIR generation. I discussed that in part III along with LIR generation, but I mention it again just to have the full list of optimizations here. Constant folding means transforming code like “a = 3 + 4″ to “a = 7″, i.e., replacing a constant expression with the result of evaluating that expression. I should note that constant folding is also used with branches: a branch instruction with a constant conditional expression is dropped entirely, because the next instruction on the trace is simply the actual target of the branch.

Ending Tracing. Trace optimization starts when a “complete” LIR trace has been generated. In principle, the tracer could stop tracing whenever it wanted to, so there’s no particular completeness property. I just mean that optimization doesn’t start until tracing stops. This is necessary because some optimization passes go backward over the trace.

Tracing is stopped by Interpreter::eot (end of trace). (eot is often invoked by the macro EOT defined in Interpreter.h, which records the reason for ending the trace in debug builds only.) Most of eot is just debugging and error checking. The key call is:

compile(strk.location, rtrk.location, assm, tracefrag);

The compile function is defined in LIR.cpp and performs constant subexpression elimination, dead store elimination, and assembly. There is code to perform lifetime splitting just before assembly, but it is guarded by if (false) in the current code.

Interpreter::eot is called if any of these conditions occur:

  • (Interpreter::loopedge) The trace follows more than MAX_XJUMPS “cross-jumps”. A cross-jump is a backward branch that does not go to the loop header (i.e., the start of the trace). A cross-jump indicates the presence of nested loops. In the current standard configuration, MAX_XJUMPS is zero.
  • (Interpreter::pre_trace) The trace contains at least MAX_BLKS guards (i.e. side exits). This is checked before tracing each instruction. In the current standard configuration, MAX_BLKS is 10,000. There is a comment next to this check, “count # of guards to minimize heisenbugs”. I’m not sure what this means, but it might be something to make the point at which traces end more deterministic.
  • (Interpreter::eot_untraceable_prim) The current instruction cannot be traced, e.g., a general native method. This is called by the trace implementation for all untraceable primitives.
  • (Interpreter::eot_if_max_exits) The trace has returned from over MAX_EXITS ActionScript functions. Recall that tracing can go right through both function calls and returns. Tracing can start in the middle of a function, and the trace could go through many returns. This is called when EXITABC is traced. In the current standard configuration, MAX_EXITS is 2.
  • (Interpreter::eot_if_max_copies) This one’s a little tricky. The purpose is to avoid the trace explosion problem: code with k sequential if statements can generate 2^k traces, and if k is too big, this will use up all memory and crash the program. So TT counts the number of copies of an instruction that exist in different traces. If the count exceeds MAX_COPIES, TT ends the trace. Thus, the total memory used can be no more than (#instructions) * MAX_COPIES. Note that TT saves work by counting the number of copies only for CFGMERGE instructions, which are special no-ops that Verifier generates at all control-flow join points. An instruction is copied only if the CFGMERGE above it is copied, so counting CFGMERGEs is good enough. In the current standard configuration, MAX_COPIES is 2.

I understand the purpose of eot_untraceable_prim and eot_if_max_copies, but not the MAX_XJUMPS, MAX_BLKS, and MAX_EXITS conditions, so please comment if you know.

Constant Subexpression Elimination (CSE). This is a standard compiler optimization. CSE replaces code like this:

x = y + z;
w = y + z;

with code like this:

x = y + z;
w = x;

This saves an operation, and may enable additional optimizations now that w is an exact copy of x.

Given the SSA property, two expressions that apply the same pure operation to the same virtual registers, e.g., “v1 + v2″ in “v17 = v1 + v2″ and “v20 = v1 + v2″, are always equivalent. A pure operation is one that has no side effects and depends only on its arguments. In TT, this includes basic arithmetic operators and also functions that are marked PURE-FUNCTION in Forth, e.g., stringlength.

In TT, CSE is performed by nanojit::cse in LIR.cpp (nanojit:: is a namespace qualifier). cse performs a single forward scan of the trace, detecting CSEs and replacing them as it goes. This is the classic “value numbering” technique. Detection is based on a hashmap where the key is the operator and operands. (See LInsHashSet in LIR.cpp, especially LInsHashSet::_equals.)

Replacement is performed by overwriting the redundant operation with a special LIR_tramp (trampoline) instruction. A LIR_tramp is simply a reference to another instruction in the trace. (In detail, a LIR_tramp has a 24-bit offset operand: if the sequence number of the LIR_tramp is i, and the operand is offset, then the instruction is a reference to the instruction at position i+offset.) My abstract CSE example above might look like this in real LIR with tramps:

15 fadd 5, 6 // 15 is sequence number/destination; 5, 6 are operands

24 fadd 5, 6

becomes:

15 fadd 5, 6

24 tramp -9

LIR_tramp isn’t an executable instruction. Rather, X tramp -Y is a directive: “Whenever you see X as an operand, instead read it as X-Y.” It’s almost like a macro definition inside LIR. “Macro expansion” is performed automatically inside LIns::oprnd1 (the getter for the first operand of a LIR instruction) and related methods. The effect is that any operand that CSE can detect as equal to X will be named X thereafter. This, in turn, makes it easy to tell that all these Xs are equal and exposes more opportunities for CSE.

Here’s an example:

Treating tramps in this way is part of value numbering and helps optimize code like this:

3 fadd 1, 2
4 fadd 1, 2

15 fadd 3, 8
16 fadd 4, 8

3 and 4 are clearly redundant. So are 15 and 16, but they have syntactically different operands, so it looks like CSE will miss them. But it doesn’t. It converts the first two instructions to:

3 fadd 1, 2
4 tramp -1

At this point, the next two instructions are now read as:

15 fadd 3, 8
16 fadd 3, 8

and CSE gets the second redundancy as well. This example also shows why CSE is done as a forward pass: CSE applied at one point may create more opportunities for CSE farther down, so we can do all the CSEs in one pass only if we go forward.

Redundant Store Elimination (RSE). This is a restricted type of dead code elimination (DCE). Informally, RSE replaces this:

x = a + b;
x = y + z;

with this:

x = y + z;

The first value of x is overwritten before it can be used anywhere else, so the first statement can be eliminated if it has no side effects.

Note that there are 2 assignments to x here in my example, so it is not in SSA form. This is correct: this TT pass is only applied to store instructions (LIR_st and LIR_sti), which can write to the same location multiple times.

General DCE can also eliminate instructions whose values are not overwritten but are never read anyway, but we can’t do that with TT stores. The reason is that TT stores write values to the interpreter stack, which may be used later once we exit the trace, by the interpreter or the next trace. While looking at the current trace, we really have no idea which of those values is used later on, so we have to keep them all. But TT does apply general DCE to non-store instructions, and for those instructions DCE is incredibly easy.

TT RSE is performed by nanojit::rmStores as a backward pass. rmStores scans the instructions, keeping track of (a) the depth of the stack, and (b) which stack positions (starting from the bottom) are stored to. rmStores does the tracking for both the data stack (sp) and the return stack (rp). For each store instruction, rmStores determines the stack position stored to, and removes the store if that position is (a) above the top of the stack, or (b) is stored to later on (i.e., earlier in the backward scan). Case (b) is just as in the example above. Case (a) picks up situations like storing a value to the top of the stack and then DROPping it.

rmStores must handle side exits specially. As above, we have to make sure the interpreter stacks are “correct” (i.e., look exactly how they would look if the interpreter had been running) when we exit the trace. This applies to side exits as well. So when the backward scan passes a side exit, it must mark everything on the stack as potentially live (by clearing the “stored-to” bits in its scan record).

This is important: it shows that side exits preclude some optimizations.

Assembly. This converts LIR to ISA (instruction set architecture code-the TLA way to say “machine language”).

The TT assembler performs register allocation (mapping the unlimited virtual registers to the very limited ISA registers) simultaneously. Offline compilers do register allocations by applying approximation algorithms to NP-hard graph-coloring problems, but the compilation time is too long for a JIT like TT, so TT uses an integrated single-pass greedy allocator. Note that nanojit/RegAlloc.h is not the register allocator: it’s just a data structure for tracking register mappings and free registers.

Assembly is platform-specific, so TT needs a mechanism to build different assemblers for different platforms. Here’s one of the key bits (in nanojit.h):

#include “Native.h”
#include “LIR.h”
#include “RegAlloc.h”
#include “Fragmento.h”
#include “Assembler.h”

Native.h includes another file, Native*.h (e.g. Nativei386.h), controlled by preprocessor defines. This imports platform size constants, register set definitions, and macros for code generation (e.g., CALL). Assembler.h defines the Assembler class. Assembler.cpp contains platform-independent assembler logic, including methods of Assembler, customized by referring to variables and macros defined in the platform-specific header. There’s also some stuff controlled by defines like NANOJIT_IA32, generally short code snippets that interact closely with otherwise platform-independent code. Finally, there is a Native*.cpp, which contains other methods of Assembler that are defined in a purely platform-specific manner.

The TT assembler works backward. I think this is because it does a few last optimizations which work best in a backward pass. Keep this in mind reading methods like Assembler::genEpilog-the first instruction generated is the return.

Assembler::assemble is the entry point, and Assemble::gen is where the per-instruction work is done.

Assemble::gen a lot of details-I’ll just look at an examples. Here’s how a LIR_imm (place a constant value “immediate operand” in a virtual register) is assembled:

case LIR_imm:
case LIR_imm32:
{
Register rr = prepResultReg(ins, GpRegs);
int32_t val;
if (op == LIR_imm32)
val = ins->imm32();
else
val = ins->imm16();
if (val == 0)
XOR(rr,rr);
else
LDi(rr, val);
break;
}

The first step is to call prepResultReg to pick a register to store the result in. I’ll look at that later, but for now I assume it just works. The next step is to get the constant value itself from the LIR instruction. Finally, we call the LDi macro to generate the instruction, unless the constant is zero, in which case we just XOR the register with itself (x XOR x = 0 for all x), which is faster (although I don’t know why). The macros aren’t very exciting reading–they just do the bit-bashing to generate ISA opcodes, addressing mode bits, and operand encodings.

Register Allocation. This code is pretty complicated, but I think I can outline what it does. The algorithm is conceptually simple; the complexities come from dealing with platform-specific details and special cases.

The algorithm tracks the set of free registers and the mapping from virtual registers to machine storage. The machine storage is represented by the Reservation class, which can name an ISA register, an activation record (stack frame) location, or both.

The first step for most instructions is to allocate registers to use for the operands.

Assembler::findRegFor finds a register that holds the result of a given LIR instruction (e.g., an operand). If the LIR instruction has already been assigned a register, it returns that register. Otherwise, it searches for a free register and records the mapping.

One of the complexities is the second argument to findRegFor, RegisterMask allow. allow represents a set of allowed registers-the returned register must be in this set. This is needed because some operations can only be used on certain registers. In some cases, the value can’t be allocated directly in the allowed set, e.g., because it is computed by an instruction that cannot output to that register. Then TT issues an extra move instruction.

It is possible that there is are no free registers. In this case, the solution is to spill a register. This means we pick a victim LIR instruction currently in a register, and store and load it around the current instruction. As a general example, we might have two different values that need to be computed into eax:

mov eax, ebx  // first instruction writing eax
add ?, ecx       // want to use eax, but it’s occupied
add esi, ?        // want eax from previous
add edx, eax   // first eax again

We spill the first writer of eax like this:

mov eax, ebx
mov ebp[-8], eax // spill eax to memory
add eax, ecx        // eax now available
add esi, eax
mov eax, ebp[-8] // restore first eax
add edx, eax

That’s simple enough, but it looks kind of tricky in TT, because TT doesn’t see the code all at once to make this transformation, but instead does adds the spill and restore code during its backward pass.

In this example, TT would detect that a spill is required when assembling “add esi, ?”. It needs to use eax, but eax is already in use, so at this point, it knows that it has to emit code to restore the victim value to eax. This is done by Assembler::asm_restore, which finds a free memory location for the victim and emits code to restore from that location. It also records that memory location in the victim’s reservation so it will know to spill the value later on. Note that for constant values, asm_restore knows it doesn’t need to load them from memory, but can just use an LDi (load immediate).

Once the operands are allocated registers, the assembler selects a register for the result using  Assembler::prepResultReg. This again calls findRegFor. But in this case, a register was probably already allocated by an later instruction that uses this result, and that register will be returned directly. Now, if the register we select was previously selected to be spilled (in asm_restore), we need to generate the spill code. This is done by calling Assembler::asm_spill, which checks the reservation to see if a memory location has been allocated to this instruction’s result. If so, a store instruction is generated.

The register allocator is closely related to the DCE (dead code elimination) mechanism. Assemble::gen calls Assemble::ignoreInstruction on each instruction to see if no code should be generated. The basic idea is that if no storage has been reserved for the result of an instruction, then nothing ever reads the result, so it is dead (as long as there are no other side effects). LIR_tramps are always ignored, which fits with my earlier description of their being a special kind of nop.

Note that all of this is done backwards. Even the debug output is generated backwards and then reversed for printing. So if you want to read the assembler output in the order that TT processes things, read it backwards.

Assembly/Register Allocation Example. Here’s a bit of debug output from the assembler, showing a spill/restore:

58 qlo   47
010E26AE  movd ecx,xmm0                esi(8) edi(6) xmm0(47)
spill 58
010E26B2  mov -8(ebp),ecx                ecx(58) esi(8) edi(6)
60 eq    58,#0
69 xt    60 -> 11DED2
010E26B5  test ecx,ecx                         ecx(58) esi(8) edi(6)
010E26B7  je 10E3512                           ecx(58) esi(8) edi(6)
GID 49
71 arg   #4
010E26BD  mov edx,4                            ecx(58) esi(8) edi(6)
72 arg   58
73 call  getslotvalue_box
010E26C2  call 309E0:getslotvalue_box        esi(8) edi(6)
restore 58
010E26C7  mov ecx,-8(ebp)                      esi(8) edi(6)

The lines that begin with decimal numbers are LIR instructions. The indented lines that begin with hex addresses are the generated assembly. The assembly lines also show the current mapping of LIR instruction results to ISA registers. There are also some notes about 58 being spilled and restored. The “GID” line indicates a guard.

Let’s read it bottom to top. The last LIR instruction is “call getslotvalue_box”, which is a native call. Native calls potentially overwrite certain registers, including ecx. The is currently a value in ecx, the result of instruction 58. (This reservation was made earlier, i.e., farther down in the trace.) This value must be spilled. But that will happen further up. For now, TT just selects a spill location, -8(ebp), and emits the code to restore that location to ecx. Now, TT can emit the call instruction.

The previous instruction is “arg 58″, which means to load 58 into an argument storage location to passed to a call. The LIR_arg instruction encodes the storage location, and it’s not shown the debug output, but apparently this one is supposed to go in ecx. Because the value is already in ecx, no code is necessary, and none is generated. This logic is accomplished by the function Assembler::findSpecificRegFor, which is a thin wrapper that just calls findRegFor with a single allowed register. As explained above, if the allowed register can be reserved for that instruction, it is, and if not, a move is emitted.

The previous instruction is now “arg #4″, which means to load literal integer 4 into an argument storage location, this time edx. The argument is a constant, so all this has to do is emit an LDi instruction. There is no need to allocate registers because edx is potentially written by the call, so if an instruction was using edx, we would have selected it for spilling when we processed the call. At this time, edx is automatically available.

The previous instruction is “xt 60 -> 11DED2″, an “exit if true” instruction. Exiting from traces is complicated, so I’m going to leave most of this for later. For now, just note that this instruction generates both the comparison instruction test and the branch instruction je. This is because of a classic compilation issue, which is that relational operators get compiled differently according to whether the result is used by a branch or by an arithmetic operator. TT’s solution is to compile the relational operator as part of compiling the branch, and then ignore it later, as shown on this trace.

Now we reach the first LIR instruction, “qlo 47″, which picks out the “low” half of a quad (64-bit operand). The result has been reserved as ecx, but memory has also been reserved (when we generated restore code earlier), so we know we need to spill the result now. After that, we can generate the move instruction for qlo.

Next time: running compiled traces.

Tamarin Tracing Internals III: LIR

Program Form 3: LIR. I believe LIR stands for low-level intermediate representation (although I’ve also heard linear intermediate representation). Typically, in a compiler or VM LIR is the lowest-level (and last) form of machine-independent compiler representation, and looks much like a machine-independent assembly language. TT’s LIR plays the same role but has some special features because it is designed specifically for efficient trace compilation. The most important feature is that a LIR trace is always straight-line code, with one or more exit points, but no targeted branches. (Actually, there has to be a target, but the target is not part of the LIR, it’s the address of an IL instruction.)

The LIR instruction set is defined in nanojit/LIR.h. LIR is much simpler than IL, with fewer than 64 opcodes. Most of them are familiar, e.g., LIR_add for integer addition. LIR_imm (”immediate”) sets a constant value, contained directly in the instruction word. All the control-flow instructions are exits: e.g., LIR_x to exit unconditionally, or LIR_xt to exit if a condition is true. The exception is LIR_loop, which jumps back to the start of the trace. TT traces always start at loop headers, so this is important. See also Mason Chang’s post on the LIR instruction set.

A LIR instruction is 32-bits, with 8 bits for the opcode and up to 3 operands. As with machine ISAs, there are different forms of instruction words to accommodate multiple operands, 24-bit immediate operands, etc.

LIR operands work differently to IL operands. Recall that IL is stack-based, so IL’s IADD takes its two arguments off the stack and pushes their sum. Because machine languages like x86 are register-based, not stack-based, LIR is also register-based. More precisely, LIR uses virtual registers (my terminology, not TT’s), which just means it can have as many registers as it wants. Mapping those registers onto the finite set available on each processor is the job of a lower-level register allocator.

LIR has an interesting implicit scheme of naming the virtual registers. Each LIR instruction has a sequence number, according to its position in the trace. The sequence number is (implicitly) also the name of the virtual register where the result is stored. Later instructions can use this result by sequence number. For example,

// Instruction 13: test whether result of instruction 12 equals immediate #48
13 eq    12,#48
// Instruction 22: exit if result of instruction 13 false
22 xf    13 -> 10C41CA

This design saves space, because result operands don’t need to be named implicitly. It also automatically represents instruction dependencies. Finally, because of this design, LIR is necessarily in SSA form, which will make optimizing traces easier.

Transformation 2->3: IL->LIR. This is the actual trace generation. So while the earlier steps translated a method at a time, in this step TT operates on a trace (code path) at a time. The basic idea is that while TT executes IL, it will generate a trace of LIR instructions for the code path it follows.

Activating Tracing. TT traces always start at loop headers. A loop header is any target of a backward branch-the IL generator ensures backward branches are loop edges. Recursive methods are also a form of loop, and so TT treats a recursive call the same as a loop edge.

Every time the interpreter follows a loop edge, it checks to see if it should begin tracing. This is encoded into vm_*_interp.h with the INTERP_CHECK_LOOPEDGE macro, which wraps a call to Interpreter::loopedge. Interpreter::loopedge maintains a count of the number of times it has seen each loop header address. If the count exceeds a threshold (HOTLOOP, which is 10 in the current version), it calls Interpreter::sot (”start of trace”) to start tracing. sot initializes data structures used by tracing, emits some boilerplate prolog LIR, and sets the interpreter tracing state flag. Finally, control will exit from the interpreter (Interpreter::do_interp, but the return statement is found in the macro _CHECK_MODESWAP, used in INTERP_CHECK_LOOPEDGE), with a return value that tells the main system loop to enter tracing mode.

Tracing. Actual tracing of IL is performed by VMInterp::do_trace. do_trace continues interpreting IL, just as in the interpeter (do_interp). In addition, before interpreting each IL instruction, do_trace emits LIR for the instruction. For example, here is the do_trace implementation of my favorite IL op, IADD, from vm_fpu_trace.h:

INTERP_FOPCODE_TRACE_BEGIN(IADD)
interp.trace_binop(LIR_add, sp);
INTERP_FOPCODE_TRACE_END(IADD)
INTERP_FOPCODE_INTERP_BEGIN(IADD)
const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;
INTERP_ADJUSTSP(-1)
sp[0].i = tmp_i_0;
INTERP_INVALBOXTYPE(sp[0])
INTERP_FOPCODE_INTERP_END(IADD)

The second part is an exact copy of the code in vm_fpu_interp.h. I believe this code is produced by the Forth compiler. After preprocessing, the block above looks like this (in VMInterp.ii):

foplabel_TRACE_IADD: { {
interp.trace_binop(LIR_add, sp);
} ip += 1; {
const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;
sp += (-1);
sp[0].i = tmp_i_0;
} goto _goto_ip; }

The only addition for tracing is the call to interp.trace_binop(LIR_add, sp). In principle, trace_binop has a simple job: emit a LIR opcode for a binary operation. In reality, the tracer does some optimizations along the way and also must do some state bookkeeping.

trace_binop. Here is the method signature for trace_binop:

void Interpreter::trace_binop(_LOpcode op, const Boxp sp, BoxUsage insize, BoxUsage outsize);

op is the LIR opcode to emit in the trace. sp points to the top of the interpreter operand (Forth) stack. The other two arguments give the operand sizes, because operands can be 32 or 64 bytes, but I’m not going to worry about that just yet, and that aspect of the code is well-separated from the main logic.

Here is the body of trace_binop:

1        if (check_const_defc(2, sp, outsize)) return;
2        LirWriter* lirout = tracefrag->lirbuf->writer();
3        LInsp i = lirout->ins(LOpcode(op), use_q_or_lo(insize, sp-1), use_q_or_lo(insize, sp));
4        varset(sp-1, i);

Line 1 tries a constant-folding optimization: informally, if both of the operands are constants, the tracer will compute the result, which is constant, and emit a LIR_imm instead of the binary operation. But what exactly is a constant operand in LIR? Recall that an operand is just the index of an instruction that computes a value. If that instruction is a LIR_imm, then the operand is constant.

Here I should say a bit about the region tracker. (Fortunately, Graydon explained it to me, so I didn’t have to work hard to figure out that part.) Recall that in the interpreter, the operands are the top two stack elements. At the start of trace_binop, that’s all we know. So in order to get the LIR operands for tracing, trace_binop has to map the stack values to their corresponding LIR operands (i.e., the LIR instructions that computed those stack values).

The region tracker, (class RegionTracker in Interpreter.h), maintains this mapping and performs the lookup. Specifically, RegionTracker mains a map from addresses (const void *) to instructions in the LIR trace buffer (LIns *). The addresses are considered to address fixed-width elements in a range starting from a zero position. This is perfect for tracking a stack. Also, the mapping can be implemented as array lookup, which is simple and fast.

Region tracker operations are wrapped for the interepreter by inline methods like varof, which maps a Box* interpreter stack operand to a LIR instruction, and varset, which emits a store operation for a result and updates the region tracker with the new result.

Back to trace_binop. Line 2 just accesses the LirWriter (LIR.h), the class that does the bit-packing to create LIR instructions.

Line 3 emits the binary operation LIR instruction. The only fancy part is in use_q_or_lo, which uses the region tracker to map an interpreter stack operand to the LIR instruction that created it.

Line 4 emits a LIR_sti to store the result (and updates the region tracker accordingly). This surprised me, because if the result of a LIR operation is implicitly stored in its instruction sequence number, why is an explicit store needed? Looking at example traces, I see that there are few or no corresponding LIR_ld instructions, so most of these stores must be dead. I think the LIR_sti is storing to a memory location that also represents the result, and will be used later for spilling by the register allocator.

Example. LIR is so low-level that nontrivial IL tends to create more LIR than I want to put here, but I did find a readable example of adding 1 to a number. This is from the -Dverbose output of avmshell (\ is a line continuation-I added a line break to fit this format):

T 11E296     op_OP_increment_plus_000a \
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10
325 int   #11E298
327 sti   #11E298, #4(6)
T 11F4E6      LITC1 \
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10
328 imm   #1
330 sti   #1, #16(8)
T 11F4E8      I2D \
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 -3:1
333 quad  #3FF00000:0
335 sti   333, #16(8)
T 11F4EA      DADD \
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 d:1
336 fadd  321,333
338 sti   336, #8(8)

The lines beginning with “T” show IL instructions. The “T” itself indicates that the line was printed while the interpreter was in tracing mode. The next number is the address of hte IL instruction. After that comes the IL opcode. Finally, the top 8 elements of the interpreter stack are shown, with the topmost at the right. The format of the stack element is type:content, where type is a boxtype constant from Box.h and content is a hex representation of the value.

The indented lines following each “T” line show the LIR generated for that IL instruction. The LIR output format is similar to typical assembly languages, and is:

instruction-sequence-number opcode operands

Operands that are just numbers are references to other instructions by sequence number. Operands like #8 are immediate operands. #16(8) is read “#offset(base pointer)”, i.e., it says to take the value computed by instruction 8 as a pointer, then add 16 to that pointer. Most of the bases seem to be for instructions in the trace prolog that load things like the stack pointer. In particular, at least in this trace, the base pointer 6 is the interpreter return pointer and the base pointer 8 is the interpreter stack pointer.

(Which leads me to ask, what are the return pointer and the stack pointer? They are part of the interpreter state (InterpState in Interpreter.h), which has 4 fields: the frame pointer, Frame* f; the instruction pointer, FOpcodep ip; the stack pointer, Boxp sp; and the return pointer, FOpcodepp rp.

* A Frame (StackTrace.h) represents an activation record (or “stack frame”) of the ActionScript code. Thus, Frames are pushed onto and popped from a stack as ActionScript methods are entered and exited. This stack is used for generating stack traces and security checks that depend on ActionScript execution context.
* The instruction pointer points to the currently executing (Forth) IL instruction.
* The stack pointer points to the top of the Forth data stack.
* The return pointer points to the top of the Forth return stack. This is the stack that implements call and return from Forth subroutines.)

Let’s see if I can understand how the tracer works:

T 11E296     op_OP_increment_plus_000a \
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10
325 int   #11E298
327 sti   #11E298, #4(6)

This first step is a call to a superinstruction. A just means pushing the return address (note that it is the current address plus 2) onto the return stack: #4(6) means an offset of 4 from the start of the return stack. In the LIR, the first instruction loads a 32-bit immediate value, and the second instruction stores that value on the return stack. Note also the missing sequence number: LIR instructions are 32 bits, so loading a 32-bit value is done with a load instruction followed by the value.

T 11F4E6      LITC1 \
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10
328 imm   #1
330 sti   #1, #16(8)

LITC1 (”literal constant 1″?) pushes the integer value 1. Here we load the immediate value 1, then store the value 1 on the data stack (recall that 8 is the stack pointer). I’m not sure why there is a missing sequence number: I think LIR_imm is just one instruction.

T 11F4E8      I2D \
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 -3:1
333 quad  #3FF00000:0
335 sti   333, #16(8)

Note that the value “-3:1″ has been pushed onto the data stack at the right. I think in this case, the value is actually an unboxed int, so the -3 is just whatever happened to be in that memory location before, and only the 1 is significant.

I2D converts an integer to a double. Here you see we are using LIR_quad (load an immediate 64-bit “quadword”) to push the IEEE 754 double value “3FF0000000000000″, more commonly known as 1.0. So there is no conversion code at all: TT has constant-folded the I2D operation because its operand is the constant int 1.

T 11F4EA      DADD \
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 d:1
336 fadd  321,333
338 sti   336, #8(8)

Now in the data stack view, the last value is “d:1″, which means double value 1.

Finally, we do a DADD, or add doubles. This uses LIR_fadd, the floating-point addition operator, on operands 321 (not shown here) and 333, the LIR way to refer to the 1.0 we loaded previously. Finally, we store the result on the stack. Note that we store it 8 units down from the result of I2D: #8(8) instead of #16(8). This is because the stack has one fewer 8-byte element, as DADD has popped two operands and pushed only one.

Here’s the state after this instruction:

T 11F4EC      DUP          -8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:11

The top of the stack is now 10+1=11. Way up above, I said that a loop header is considered hot and gets traced after 10 hits. And here we just incremented something by 1, and get 11. So even though there’s a ton of LIR before this, we know this is actually working on the variable “i” in the original program.

Next time: trace optimization.

Tamarin Tracing Interals, Part II: Forth

The Need for Forth Subroutines. I had a really hard time tracking down how TT adds a pair numbers (ActionScript code like “sum += i”) worked until I finally figured out that ECMAScript “+” is not a primitive operation in TT. This makes perfect sense now, as “+” is complicated: it has to do different things depending on the argument types (floating-point addition, string concatenation).

It would still be possible to make “+” an IL primitive implemented in C++, but it’s better not to, and TT doesn’t. The reason is that TT wants to be able to specialize the code for “+” when tracing. For example, if the arguments are doubles on the trace, then TT could emit only the floating-point addition code, which is faster and smaller than the general code. But this is hard to do right if “+” is an IL primitive. In that case, the tracer would need complex C++ code to perform the specialization for every primitive of this kind. And even the slightest difference between the tracer’s logic and the interpreted code would cause weird VM bugs. As Graydon told me, TT tries hard to avoid this kind of redundancy.

Instead, TT implements ECMAScript “+” (IL OP_add) as a subroutine of IL instructions. I’m not sure what the official TT terminology is, but extern is the word used in code identifiers. To execute an extern in interpreted mode, the system simply jumps to that subroutine, executes the IL instructions that carry out the case logic, and returns when done. In tracing mode, the system can trace and optimize the extern’s IL just as it does any other IL.

Basic Forth. Externs are written in Forth. (The Adobe guys explained about Forth in comments to my last post.) Forth is pretty close to IL, so the Forth compiler can be pretty simple (fc.py, 1900 lines of code). I don’t know Forth, but I did once do a bunch of HP-28S programming, which used a Forthlike language.

Mason Chang has some good pointers on Forth. I’ll also give a quick summary of what I found out. In Forth, the state of the system is pretty much just a stack, and Forth code is pretty much just a sequence of stack operations separated by spaces. The stack operations are called Forth words (”word” as in “command”, no relation to machine words). Take this code snippet:

0 i2d

“0″ is a Forth word that means “push 0 onto the stack”. “i2d” is another Forth word, a TT primitive operator that converts the top of the stack from an int to a double. So the total effect of this snippet is to push double 0.0.

We can package this as a Forth subroutine named “float0″ just like this:

: float0 0 i2d ;

“float0″ is now a Forth word, usable like any other. The colon and semicolon start and end a definition. (I think colon and semicolon are “officially” Forth words themselves, although the TT Forth compiler fc.py treats them more like syntax.)

Cases. Another interesting Forth feature is case “statements” (not sure what the correct term is). I think this is a typical feature but the exact way it is defined in TT is specific to TT. In TT, a Forth word can be defined as a case statement, which seems to be used mostly for dynamic dispatch. E.g.:

CASE: toboolvec ( xvalue bt — bool )
( 0 BoxedDouble ) d2b
( 1 BoxedNull ) no-op

( 8 BoxedInt ) i2b ;

This defines a new Forth word toboolvec, which converts a value to a boolean. (I think “vec” is a TT conventional ending for case-defined words.) TT Forth CASE is pretty tricky, and I had to look around for a while to figure out how what it really does.

The first thing to note is that “( … )” is a comment in Forth. The first comment in the CASE above is a stack comment, documenting the effect of the word on the stack. This word pops 2 inputs: xvalue, a boxed value to convert to a boolean, and bt, a boxtype constant that represents the type of xvalue. The word then pushed one value: the boolean representation of xvalue.

Second, unlike a C switch, these CASEs don’t represent the conditions for each case explicitly. Rather, a CASE pops the top value (which should be boxtype, here), which must be an integer k, and then executes the kth word of the body of the case. The comments in the case body document this relationship and make it a lot easier to read.

toboolvec is used by an easier-to-use word, tobool:

: tobool ( xvalue — bool ) boxtype toboolvec ;

boxtype ( x — x i ) pushes the boxtype constant indicating the type of the top of the stack, which must be a boxed value. So tobool just converts the top of the stack to a boolean.

OP_add. Now I actually know enough to understand how OP_add is defined. It’s defined in vm.fs in several pieces, but basically it uses the boxtype word to get the type of both operands, then case words that double dispatch to a type-specific addition operator. When adding two doubles the final addition operator is w_fadd. This is defined:

EXTERN: w_fadd f+ dbox ;

f+ is another name for DADD, which defined:

PRIM: DADD (( f0 f1 — fr ))
interp:{ fr = f0 + f1 ; }
trace:{ interp.trace_binop(LIR_fadd, sp, USE_Q, USE_Q); } ;

(Mason explained how primitives work.)

Sadly, it appears from reading the int+int case that when adding two ints, TT must promote them to doubles because ECMAScript requires numeric addition to work as if the operands were doubles. I wonder how much of a performance penalty this is for array-iteration code and if there are any opportunities to optimize this by figuring out when it’s safe to keep the int representation, perhaps by loop variable interval analysis or speculation…

Tamarin Tracing Internals, Part I

Tamarin (technically, tamarin-tracing, henceforth TT)-related projects keep peeking up at me from the horizon. First, there’s a good chance I’ll have an intern working on TT this summer. And then there’s this “Tracehydra” idea. It’s a way to connect Spidermonkey’s JS parser with the TT execution engine. This is the plan:

Where “profit” means “run Javascript really fast”. Tracehydra would be the fluffy cloud that translates Spidermonkey bytecode to Tamarin IL (or possibly LIR-the details get confusing fast). (In the interest of reducing confusion slightly, I’ll say that IL stands for intermediate language, and is roughly a synonym for bytecode. TT people often refer to their IL as “Forth” because they based the design on Forth or something, but I know nothing more about Forth than that it involves stacks, so that doesn’t help me.)

Specifically, Tracehydra means using Treehydra to translate the Spidermonkey (SM hereafter) C code that interprets each bytecode into C++ that emits equivalent (to the C) Tamarin IL. So I guess it reduces the problem from translating SM bytecode TT IL to translating SM C cases to TT IL-building code. Put that way, it’s not clear this actually helps, but I think SM bytecode is believed to have complex semantics that would be difficult to code in TT IL by hand, and maybe the C in SM has fewer constructs that are easier to translate. Seems possible, anyway.

If I’m gonna make any sense out of this I need to learn something about TT IL and how the TT VM uses it.

Digging in to Tamarin. There doesn’t seem to be a lot of documentation on TT, so I thought I’d write up whatever I managed to figure out, for my own benefit at least. By the way, I’ve probably gotten some things wrong, so any TT experts are highly encouraged to correct me.

I should mention that Chris Double’s Tamarin posts have been invaluable-they are the best source I know of that explains how to actually build and run TT. And Graydon Hoare’s diagrams and comments are what got me started on having any idea where to look for anything in the code.

My first question was, what is the “life cycle” of a program run by TT? The TT shell as it exists today runs ActionScript programs, specifically ABC files (ActionScript bytecode). Once the tracer kicks in, TT is running machine code traces (e.g. x86-64 ISA). (Ugh. This is turning in alphabet soup.) What goes on in between? Here’s what I found.

By the way, this picture gives an overview. (Picture does not exist yet.)

Program Form 0: ActionScript. I followed a sample program, the simplest program I could think of that will get traced (you need a hot loop), through TT. Here’s my program:

  var sum = 0;
  for (var i = 0; i < 1000000; ++i) {
     sum += i;   
  }
  print(sum);

By the way, on my MacBook, this runs in 1.7s in interpreted mode (tracing disabled) and .28s with tracing enabled, a 6x speedup. And that includes VM startup time. For comparison, Java runs the equivalent program in .13s. (But Java gets the answer “wrong”, because ActionScript uses unlimited-precision integers while Java uses int32s. So this test is unfavorable toward TT.) Excluding startup time, Java takes about 4ms, apparently, the same as C. I’ll have to retest TT excluding startup time, once I figure out how.

Program Form 1: ABC. This is the ActionScript bytecode and is the input to the current Tamarin shell. I don’t need to know too much about this, but the real Tamarin stuff is generated from here, so I peeked at the ABC for my program. ABC is apparently a stack-based bytecode language, much like Java bytecode. Here’s a snippet from my sample program, with my annotations after //:

  // var sum = 0;
  // sum was assigned local variable ’slot 0′ in ABC file headers
  // Stack starts as: [stuff]
  pushbyte 0
  // Pushed a 0 value onto the stack. Stack is now: [stuff] 0
  getglobalscope
  // Pushed the global ‘this’ onto the stack. Stack is now: [stuff] 0 this
  swap
  // Swapped top 2 elements of stack. Stack is now: [stuff] this 0
  setslot 1
  // Stored top of stack into slot 1 of next stack element. Stack is now: [stuff]

I’m not entirely sure how the compiler writers manage to get all swap, over, dup, pick3, etc. operators right, but the code itself is understandable enough.

Transformation 0->1: ActionScript->ABC. This can be done externally to TT by the ActionScript Compiler, ASC, which is part of the Flex SDK.

Program Form 2: IL. Tamarin IL is yet another stack-based bytecode. This is the IL that the VM executes directly when in interpreter mode. The basic operations include:

  • Stack manipulators, such as DUP, DROP, OVER,
  • Arithmetic and logical operators, such as IADD,
  • Control flow operators, such as LBRT,
  • Object and variable storage operators, such as SETSLOTVALUE_I,
  • Interpreter internals operators, such as _debugenter, verify_x, and
  • Weird stuff like op_ROTNAME2_SWAP_ROT_SETRT.

I think the weird stuff might be “superinstructions”, which are instructions that just implement a short sequence of basic instructions. Apparently they help interpreters run faster because by reducing decode overhead, like old CISC processors.

These instructions are defined in files named core/vm_*.h, e.g., vm_min_interp.h. These files are heavily macroized, which allows the actual meaning to be controlled by defining those macros in different ways, although I’m not sure exactly how this feature is used yet. Here’s IADD:

  INTERP_FOPCODE_INTERP_BEGIN(IADD)
    /* IADD None {’stktop’: 0} */
    const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;
   
    INTERP_ADJUSTSP(-1)
    sp[0].i = tmp_i_0;
    INTERP_INVALBOXTYPE(sp[0])
  INTERP_FOPCODE_INTERP_END(IADD)

As you can see, it adds the top two elements of the stack (sp) as integers, cuts the top element off the stack, and stores the sum in the new stack top. There’s also some junk about invalidating the box type, which seems to be some kind of debugging feature. When this is included in VMInterp.cpp, the begin and end macros will turn it into a switch case. From VMInterp.ii in my build:

  foplabel_IADD: { pre_interp(interp, f, ip+-1, sp, rp);
    const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;
    sp += (-1);
    sp[0].i = tmp_i_0;
    do { } while (0);
  goto *k_foplabels_interp[*ip++]; }

pre_interp just prints out a trace of the instruction if the interpreter is in verbose mode and a bunch of other conditions are true. The “computed goto” at the end is an indirect jump to the case for the next instruction. This is some kind of optimization but I’ve never gotten a really convincing answer as to why it works, or if in fact it works, so I won’t go into it here. Processor experts, feel free to educate me.

The main control flow operators (i.e., used for if) seem to be LBRT and LBRF (local branch if true/false?). LBRT branches to a selected point in the IL sequence if the top of the stack is true, interpreted as a boolean. The target is specified as an offset from the address of the LBRT instruction. The target is given in the IL stream in the 2 16-bit units immediately following the LBRT code.

I also wanted to know what a function call looks like in IL, and it was surprisingly hard to figure out for some reason that I still don’t understand. But it looks like a standard call to a JS function is through an opcode w_callprop_only or w_callprop_argcok. These opcodes don’t seem to be defined in the usual vm_*_interp.h, but somehow they are made to branch to foplabel_TRACE_super_or_extern in VMInterp.ii. That code does the usual saving of a return pointer and setting the instruction pointer. Returning is accomplished with a less mysterious w_returnvalue or w_returnvoid opcode.

The type of the instructions is FOpcode, which is a 16-bit int.

Transformation 1->2: ABC->IL. This is where Tamarin starts. This step is really important because some other system, like Tracehydra, that wants to use Tamarin, should basically do the same thing, except for some other form of input instead of ABC.

Tamarin performs this transformation on one method at a time, which is typical of JITs. The transformation is perfomed by Verifier::verify, which simultaneously verifies the ABC (checks for ill-formed ABC, like the Java bytecode verifier) and outputs Tamarin IL. The entry point to the verifier is apparently  Interpreter::verify_x, which is just the implementation of an IL instruction also called verify_x.

That last part was probably pretty confusing. It surprised me, at least. It means the interpreter is already running IL by the time it actually translates and runs any ABC. I think what this means is that when the shell starts up, it starts the interpreter with a bit of boilerplate IL. The IL itself has code to verify and run the ABC program.

[Warning: compiler-expert-level material.] The verifier works as an abstract interpretation of the ABC. It uses the information gathered both to check for problems with the ABC and to guide IL generation. The state of the abstract interpretation is an abstraction of the ABC stack, just a list of types of objects on the stack, along with a lot of flags describing other conditions.

A standard abstract interpreter works as a fixed-point solver that possibly makes several iterations over the code. But TT’s verifier doesn’t work like that at all. The reason is that it requires that the states be equal at every join point, otherwise the ABC is invalid. So if the ABC is valid, then the interpreter never needs to look at a given instruction twice. And because backward branches in the ABC exist only for loops, this means the verifier can just run in a single forward pass through the ABC bytecode sequence. But it’s still really an abstract interpreter. Pretty neat. [End super-hard stuff.]

So the verifier runs over the ABC in sequence, always tracking the current abstract stack state. For each instruction, it generates IL. Part of generating the IL is generating any IL needed to fetch operands-the verifier can use the stack state to figure out where they are. After generating IL, the verifier simulates the effect of the instruction on the stack. This is all handled by a big switch on the ABC opcode, and each case has separate logic to generate IL and simulate the results.

The actual writing of the bits and bytes of IL is delegated to a class MethodWriter, which in turn delegates to ForthWriter (there’s that Forth stuff again). ForthWriter is a small class that maintains a buffer of IL bytecodes. It has a small API with methods like emit_simple(), which emits a simple IL instruction. MethodWriter is a wrapper used to generate Forth from ABC. The MethodWriter API takes ABC instructions as input and then tells the ForthWriter to emit the corresponding Forth bytecodes. For ABC opcodes with a direct IL equivalent, it looks up the equivalent in k_abc_opcode_map, which ultimately comes from vm_*_codepool.h. Otherwise it just has to work a little harder.

Time for a little example. I’ll use the same snippet I used above, with the IL translation trace straight out of the log file:

                  frame: global * * global
  2:pushbyte 0
+000D LITC0
+000E w_ibox
                  frame: global * * global int
  4:getglobalscope
+000F OVER
                  frame: global * * global int global
  5:swap
+0010 SWAP
                  frame: global * * global global int
  6:setslot 1
+0011 LITC4
+0012 w_ck_setslot_box
                  frame: global * * global

I had no idea what any of this was the first time I looked at it. There are 3 kinds of lines. First, the “frame: ” lines are the abstract interpreter’s stack state (”stack frame”) at each point. Second, the lines that start with numbers are ABC bytecodes. Third, the lines that start with “+” are the IL translation of the ABC code just above them.

The first bytecode:

                  frame: global * * global
  2:pushbyte 0
+000D LITC0
+000E w_ibox
                  frame: global * * global int

‘pushbyte 0′ in ABC pushes the value 0 onto the stack. In IL, this is two steps: first, an opcode that pushes the literal C int value 0 (i.e., a machine word of 0 bits), then an opcode that boxes that C int into a Box. Box is Tamarin’s standard value type that can hold anything. To C, a Box looks like an IEEE-754 64-bit NaN. I guess any float that has an exponent (bits 52-62) of all 1s and a fraction that is not all 0s is a NaN, so there are really 2^53-1 different NaN values. Tamarin cleverly uses only one of those in floating-pointer computations so it can pack other data types, like 32-bit ints, into the 2^53-2 spare NaNs. A Box starting with the bit pattern 1111111111000 is an int, and the rightmost 32 bits contain the int data. See Box.h.

In our example, note how the verifier knows the top of the stack now holds an int.

The second bytecode:

                  frame: global * * global int
  4:getglobalscope
+000F OVER
                  frame: global * * global int global

This is kind of interesting. An object-model-type ABC instruction, getglobalscope (pushes scope to use for unqualified name lookups), got translated into a stack manipulation IL instruction, OVER (pushes the value just under top of the stack). Apparently, the global ‘this’ scope goes on the stack at the start of a method, and the interpreter records its position. If no other scopes have been entered (no with blocks), then the verifier can just emit an instruction to pick it from its current depth in the stack, since of course the interpreter also always knows the stack size. If there are other scopes in play, then the verifier emits a w_getouterscope instruction, which calls the interpreter C method Interpreter::getscopeobj, which goes through a few levels of direction but is ultimately pretty simple, grabbing a scope chain for the current method, and then picking off the first scope. Well, I really don’t know what that means, because I’m not too clear on scope chains yet, but it’s only a few lines of code, anyway.

That’s enough for now.