<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>David Mandelin's blog</title>
	<atom:link href="http://blog.mozilla.com/dmandelin/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/dmandelin</link>
	<description>Just another Blog.mozilla.com weblog</description>
	<pubDate>Thu, 28 Aug 2008 02:57:06 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6</generator>
	<language>en</language>
			<item>
		<title>Inline threading, TraceMonkey, etc.</title>
		<link>http://blog.mozilla.com/dmandelin/2008/08/27/inline-threading-tracemonkey-etc/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/08/27/inline-threading-tracemonkey-etc/#comments</comments>
		<pubDate>Thu, 28 Aug 2008 02:57:06 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[spidermonkey]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/?p=20</guid>
		<description><![CDATA[It&#8217;s been a long time since I&#8217;ve posted here-I wanted to post some interesting results about speeding up SpiderMonkey using inline threading, but it turned out to be really hard and took a long time to get close enough to &#8220;interesting results&#8221;. At last, my patch is good enough to run SunSpider, and runs it [...]]]></description>
			<content:encoded><![CDATA[<p>It&#8217;s been a long time since I&#8217;ve posted here-I wanted to post some interesting results about speeding up SpiderMonkey using inline threading, but it turned out to be really hard and took a long time to get close enough to &#8220;interesting results&#8221;. At last, my patch is good enough to run SunSpider, and runs it 8% faster than baseline (non-tracing trunk SpiderMonkey from a few weeks ago), 10-20% faster on favorable benchmarks, and 48% faster on 3bit-bits-in-byte. So that&#8217;s pretty cool.</p>
<p>Of course, 8% looks pretty puny next to the huge gains of <a href="http://ejohn.org/blog/tracemonkey/">TraceMonkey</a> (congrats to <a href="http://andreasgal.com/2008/08/22/tracing-the-web/">those guys</a>, by the way). But I&#8217;m assured that interpreter speedups still count, so I&#8217;m chugging along. (Side note: inline threading speeds up SunSpider&#8217;s access-fannkuch benchmark by 22%, which has proved difficult to optimize with tracing.) (Side note 2: I&#8217;ve been told that TraceMonkey will hugely speed up our <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=430328">static analysis scripts</a>, maybe 10x or so, which is great news.).</p>
<p>Insane, gory detail on inline threading, related optimizations, and detailed performance analyses can be found in <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=442379">bug 442379</a>. I thought I&#8217;d go over the key ideas in this post:</p>
<p><strong>Inline threading. </strong>Basically, this is yet another interpreter opcode dispatch technique. I previously wrote about <a href="http://blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/">opcode dispatch</a>, concluding that direct-threading, in which the opcode is a target address, and the code to start the next op is a single indirect jump instruction is the ultimate efficient dispatch mechanism. It turns out I was wrong.</p>
<p>Inline threading is the &#8220;best&#8221;, because it gets dispatch down to 0 instructions. The idea is to create a buffer, and copy into it the native code for each opcode to be executed. For example, for a function body like &#8220;return a+3&#8243;, the opcodes are: JSOP_GETARG (0), JSOP_INT8 (3), JSOP_ADD, JSOP_RETURN. To inline thread this, we create a buffer and fill it with native code like this, using memcpy:</p>
<p>code for JSOP_GETARG<br />
code for JSOP_INT8<br />
code for JSOP_ADD<br />
code for JSOP_RETURN</p>
<p>It&#8217;s like a really crude form of JIT compilation.</p>
<p>To run the function, we just jump to the start of the buffer, and then it all runs, with no further dispatch code. I&#8217;ve found that an average SpiderMonkey op executes about 35 instructions, so inline threading removes the 4 instructions for indirect threading, reducing this to 31, and should speed up SM by about 11%. Nice!</p>
<p>For more info on inline threading, see <a href="http://www.google.com/url?sa=t&amp;source=web&amp;ct=res&amp;cd=8&amp;url=http%3A%2F%2Fwww.sable.mcgill.ca%2Fpublications%2Fpapers%2F2003-2%2Fsable-paper-2003-2.pdf&amp;ei=vhK2SPn4Eo6usAODrozoBg&amp;usg=AFQjCNH5LmdRjGzfb38MEbwmuGSZwcEt9A&amp;sig2=JIYHlp6oZwG2s_LhumYExA">SableVM</a> and <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.8829">this paper</a>.</p>
<p><strong>Hard Stuff.</strong> The only problem is that what I just described doesn&#8217;t actually work. For one, the compiler doesn&#8217;t necessarily compile each SpiderMonkey op handler into a single block of code. In fact, it usually reorders things a bunch, to help with code locality and reduce the number of jump/branch instructions executed on hot paths. So those are too hard to inline. (It could be done by disassembling parts of SpiderMonkey, analyzing the results, and doing code layout again, but that&#8217;s a bit much.)</p>
<p>Also, because most jump, branch, and call instructions express their targets using a relative address (an offset from the IP of the following instruction), any jump outside our little inline-threaded buffer becomes a jump to hell. So those would all have to be identified and patched, again with a dissassembler.</p>
<p>In general, small ops that don&#8217;t call functions, generate errors, or have much internal control flow <em>can</em> be inlined, but everything else can&#8217;t. Fortunately, &#8220;small ops&#8221; include some really common ones like JSOP_GETVAR and JSOP_POP, but the technique can&#8217;t be used without some special treatment for the big ops.</p>
<p><strong>Call Threading/Context Threading. </strong>For big ops, I used something called <em>call threading</em> or <em>context threading</em>. Call threading has been used to produce big speedups on some interpreters, but turns out not to help at all for SpiderMonkey. But it can handle big ops in an inline-threading system, and after a lot of work, I at least got it to not slow down SpiderMonkey.</p>
<p>The idea of call threading is to create a native buffer, but fill it with calls to the opcode handlers instead of copying the whole handlers in. With the previous example, it gives you:</p>
<p>call &amp;JSOP_GETARG<br />
call &amp;JSOP_INT8<br />
call &amp;JSOP_ADD<br />
call &amp;JSOP_RETURN</p>
<p>Those are x86 call instructions. For this to work, the opcode handlers have to end with &#8216;ret&#8217; instructions. This means dispatch is 2 instructions, which is better than indirect threading&#8217;s 4, and I think better even than direct threading because that&#8217;s really 3 instructions if you count getting the op, which I should have before. Also, these call and ret instructions are highly predictable (99.9%+ prediction rate), unlike the indirect jumps used by direct and indirect threading, which are very unpredictable. Since a mispredict is very costly (~16 cycles on Core 2, I think), this gives a big speedup, 30% or so on some interpreters.</p>
<p>Except on SpiderMonkey, where unfortunately it doesn&#8217;t work and also slows things down. This was very frustrating.</p>
<p>The reason it doesn&#8217;t work is that when you execute a &#8216;call&#8217; instruction, you push the return address onto the stack, decrementing the stack pointer ($esp) by 4. The opcode handler will then crash if it calls a function. There are actually several reasons why but the most important is that on most systems, the stack pointer has to be 16-byte-aligned when you call a function, and that -4 puts it off.</p>
<p>To make it work, I add extra code to unpush the return address right after the call, and then unpop it back right before the return. This works, but now we&#8217;re up to 4 instructions, so we haven&#8217;t saved any instructions over indirect threading. (I would love a better solution, but I couldn&#8217;t think of one.)</p>
<p>Next, it turns out that in SpiderMonkey, due to clever optimization by Igor &amp; Co., the branch prediction rate is already excellent in practice: 80-100% on benchmarky-type programs. (The trick is make sure there is a separate indirect jump going out of the end of each opcode, so the processor can predict each one independently. Also, the Core 2 indirect branch predictor seems very smart.) So branch mispredicts are only costing an average 0-3 cycles per op. SpiderMonkey takes about 28 cycles per op, so this gives an estimated speedup of 0-12%.</p>
<p>Last, the really tragic thing is that the changes I made to make SpiderMonkey do call threading make GCC shoot itself in the face. For reasons I don&#8217;t entirely understand, GCC compiles the op handlers &#8220;differently&#8221;, so they run at least 5% slower. It took some doing to figure out how to keep the slowdown down to 5%, which makes call threading about even with indirect threading. That&#8217;s pretty disappointing, but at least it means I can call thread the big ops without penalty (on average), and then inline the small ops, getting some speedup. With the example code, I generate:</p>
<p>code for JSOP_GETARG<br />
code for JSOP_INT8<br />
call &amp;JSOP_ADD<br />
call &amp;JSOP_RETURN</p>
<p>(The optimizer issue is incredibly arcane, but I think I traced the problem to an optimization pass called &#8220;gcse2-post-reload&#8221;, which is really partially-redundant-load elimination after register allocation. Any kind of <a href="http://en.wikipedia.org/wiki/Partial_redundancy_elimination">partial redundancy elimination</a> (PRE) can slow code down. The slowdown effect should be mitigated when using profile-guided optimization (PGO), but I couldn&#8217;t get GCC PGO to work on SpiderMonkey to test that theory.)</p>
<p>For more on call threading, see this <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.1271">paper</a>, keeping in mind that the results would probably differ on Core 2 because of its presumed better indirect branch predictor.</p>
<p><strong>Inlining-enabled optimizations.</strong> Inline threading some stuff and call threading the rest doesn&#8217;t yield exciting gains for SpiderMonkey, maybe 2% overall and 5-20% on the &#8220;good&#8221; benchmarks. (I guess that&#8217;s not too bad, the 2% overall just seems really weak.) But inline threading enables a bunch of other cool optimizations that could speed things up more. I&#8217;ve only done the easy ones so far, and that got the 48% speedup on 3bit.</p>
<p>Easy optimization #1 is to stop updating the SpiderMonkey PC. The call and inline threaded code tells what op handler to run next, so the PC is unnecessary. Actually, some ops do use the PC, and I&#8217;d like to make them not do that someday, but in the meantime I can just generate code to set the PC in my native code buffer:</p>
<p>mov 0, [pc]<br />
code for JSOP_GETARG<br />
mov 2, [pc]<br />
code for JSOP_INT8<br />
call &amp;JSOP_ADD<br />
mov 5, [pc]<br />
call &amp;JSOP_RETURN</p>
<p>It looks like I only saved the PC update for JSOP_ADD, but it&#8217;s better than that because the standard code has to load the PC, increment it, and then store it back, which I&#8217;ve replaced with just one instruction. The PC optimization is relatively easy and saves 0-3 instructions per op, which means a 0-10% speedup by itself. And it actually works.</p>
<p>&#8220;Easy&#8221; optimization #2 is to specialize certain opcodes. Take JSOP_INT8 as an example. This op pushes an integer onto the interpreter stack. That&#8217;s 3 or 4 instructions, to manipulate the stack pointer and store a value to it. But it also has to get the value out of the opcode stream and convert it to a jsval (the tagged value type of the interpreter), so it&#8217;s actually 9 instructions. (And JSOP_INT32 is much worse because it has to fetch 4 bytes from the opcode stream and shift and or them together.) But if we&#8217;re inlining JSOP_INT8, we can just inline the 3 or 4 instructions using the actual value, a version of JSOP_INT8 &#8220;specialized&#8221; for the given value. I do the specialization by writing the op in assembly code with a dummy value (0xdeadbeef), finding the location of the dummy as part of interpreter startup, and then patching that location as I inline. This seems crazy, but all the other design options seem crazy in their own ways, so I went with it for now. With specialization in play, the example looks like this:</p>
<p>code for JSOP_GETARG specialized for slot 0<br />
code for JSOP_INT8 specalized for jsval for 3<br />
call &amp;JSOP_ADD<br />
mov 5, [pc]<br />
call &amp;JSOP_RETURN</p>
<p>Because JSOP_GETARG and JSOP_INT8 use the PC only to get their arguments, which specialization bakes in, we get to remove some more PC updates too.</p>
<p><strong>Compiler nitpickery.</strong> In itself, all this stuff works, and it&#8217;s been used in various research projects before, and probably some other interpreters by now. But this particular implementation of it depends a fair amount on what the compiler does: at least (a) that the compiler doesn&#8217;t reorder small ops too insanely, (b) that you can take the address of labels (to get the start and end of op handlers), (c) patching in some &#8216;ret&#8217; instructions at runtime (needed to solve an obscure problem I didn&#8217;t feel like discussing here), and (d) that the compiler doesn&#8217;t deoptimize too badly when you do all this. That seems possible, but it&#8217;s not clear yet that this will play well across different versions of GCC and ICC. (Crazy side note: in my tests, GCC 4.2 on Mac regresses SpiderMonkey SunSpider by 5% vs. GCC 4.0.)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/08/27/inline-threading-tracemonkey-etc/feed/</wfw:commentRss>
		</item>
		<item>
		<title>SquirrelFish</title>
		<link>http://blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/#comments</comments>
		<pubDate>Wed, 04 Jun 2008 01:46:39 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/?p=19</guid>
		<description><![CDATA[If you&#8217;re reading this, chances are that you already know about SquirrelFish, Appl/WebKit&#8217;s new Javascript implementation. Early tests show SquirrelFish to be 60% faster than WebKit 3.1 JS, 46% faster than Spidermonkey and 52% faster than TT (Tamarin Tracing) on SunSpider.
Clearly we have some work to do. The plan is to improve TT so that [...]]]></description>
			<content:encoded><![CDATA[<p>If you&#8217;re reading this, chances are that you already know about <a href="http://webkit.org/blog/189/announcing-squirrelfish/">SquirrelFish</a>, Appl/WebKit&#8217;s new Javascript implementation. Early tests show SquirrelFish to be 60% faster than WebKit 3.1 JS, <a href="http://summerofjsc.blogspot.com/2008/06/squirrelfish-has-landed.html">46% faster than Spidermonkey</a> and <a href="http://www.satine.org/archives/2008/06/03/squirrelfish-is-faster-than-tamarin/">52% faster than TT (Tamarin Tracing)</a> on SunSpider.</p>
<p>Clearly we have some work to do. The plan is to improve TT so that hot loops run highly optimized native code; TT&#8217;s optimizer is in the early stages, and we think there&#8217;s a lot of room for further optimization. For example, as explained in my previous posts, when TT jumps from one trace to another, it has to save the interpreter state to a standard format and then reload the state on the new trace, and no cross-trace optimization is possible. Ideas like <a href="http://andreasgal.com/2008/02/28/tree-folding/">trace folding</a> have potential for a big improvement here.</p>
<p>But I still wanted to take a quick peek into why SquirrelFish&#8217;s interpreter is so fast. <a href="http://webkit.org/blog/189/announcing-squirrelfish/">The announcement</a> touts 3 key design features:</p>
<ul>
<li>Using a <strong>bytecode</strong> interpreter.</li>
<li>Using <strong>direct-threaded</strong> interpreter dispatch.</li>
<li>Using a <strong>register-based</strong> bytecode interpreter.</li>
</ul>
<p><strong>Bytecode interpreter. </strong>The old WebKit JS was based on an AST walk, which is explained in some detail in the announcement. It&#8217;s well-known that bytecode interpreters are faster, so it&#8217;s no surprise Apple made this change. Spidermonkey and TT&#8217;s interpreter have always been bytecode interpreters.</p>
<p><strong>Direct-threaded dispatch.</strong> Interpreters tend to spend a lot of time dispatching bytecode operations. Direct-threaded dispatch is a technique for efficient dispatch.</p>
<p>The obvious way to write a bytecode interpreter is with a switch statement inside a loop:</p>
<p>void run(Bytecode *ip) {<br />
for (;;) {<br />
switch (*ip++) {<br />
case OP_ADD: &#8230;<br />
case OP_JUMP: &#8230;</p>
<p>Each iteration runs one bytecode instruction. Each case of the switch handles one instruction type. It really doesn&#8217;t look like there&#8217;s any room for improvement here unless you look at the assembly code generated for the switch dispatch (from a tiny test interpreter I wrote today, comments added by me):</p>
<p># ip is in %edx<br />
# Check that switch expression (%edx) is in table range<br />
cmpl    $9, (%edx)<br />
leal    4(%edx), %ecx<br />
ja    L26<br />
# Look up case address offset for (%edx) in table (L37)<br />
movl    (%edx), %eax<br />
movl    L37-&#8221;L00000000006$pb&#8221;(%ebx,%eax,4), %eax<br />
# Add base address to offset<br />
addl    %ebx, %eax<br />
# Indirect jump to computed address<br />
jmp    *%eax</p>
<p>The basic idea is to keep a table of relative offsets to the cases, and then jump using that offset. Because the switch expression could evaluate to anything, the compiler must first generate a range check, so that if the switch expression doesn&#8217;t map to any case, the program leaves the switch instead of crashing unpredictably.</p>
<p>But in reality the range check is useless, because the interpreter can control what bytecodes actually appear, so this is 3 wasted instructions. Also, the lookup and base+offset address computation seems kind of clunky. I&#8217;d prefer dispatch code something like this:</p>
<p>jmp    *%edx</p>
<p>This is direct threading. In principle, the idea is simple: the instruction code (e.g., OP_ADD) is the address of the case target code. (In the basic design, instruction codes are integers.) Then, this jump is all you need for dispatch.</p>
<p>Coding up direct threading is weird; normal C compilers don&#8217;t know how to do it. But it can be done with GCC&#8217;s computed goto extension (see the paper on direct threading from the SquirrelFish announcement). See also my tiny interpreter.</p>
<p>I believe TT and Spidermonkey use an intermediate design called indirect threading, which gets most of the speedup of direct threading, but allows integer opcodes. The opcode is an index into a table of case target addresses. So the dispatch code has to look up the case target, then jump, something like:</p>
<p>leal   TABLE(%edx,4), %eax<br />
jmp   *%eax</p>
<p>Not too bad. I have no idea how significant the difference between direct and indirect threading is in practice, but even a few percent speedup would be great.</p>
<p><strong>Register-based interpreter.</strong> &#8220;Traditional&#8221; interpreter design keeps all operands and intermediate values of expressions on a data stack. (This is completely different from the call stack.) I can think of a few design reasons why this might be, but I really don&#8217;t know why it&#8217;s been done this way. Anyway, with a stack, bytecode for a line of code like &#8220;x = a + b + c&#8221; looks like this:</p>
<p>LOAD a  # push a onto stack<br />
LOAD b<br />
ADD      # pop 2 elements, add them, push the result<br />
LOAD c<br />
ADD<br />
STORE x</p>
<p>One nice thing about this is that the algorithm for generating this bytecode is very simple, so it&#8217;s not a lot of work to code and runs fast. But there&#8217;s a lot of code just to manipulate the stack here, and it might be nice to avoid it. (Real stack-based programs have even more stack manipulation, with DUP, SWAP, ROT, etc. operations.)</p>
<p>A register-based design avoids the stack by using a fixed array of &#8220;slots&#8221; or &#8220;registers&#8221; for operands. (These registers are completely different from assembly code registers.) The register-based version of the above line of code would be something like:</p>
<p>ADD temp, a, b   # temp = a + b<br />
ADD x, temp, c</p>
<p>Keep in mind that in the register-based bytecode, where I have &#8220;x&#8221; it would actually have something like &#8220;0&#8243;, if x had been assigned slot 0 in the register table. Note also that the bytecode generator has to analyze the code a bit to decide how many registers are needed and assign them to the variables and temporaries.</p>
<p>Each design has advantages and disadvantages:</p>
<ul>
<li><strong>Bytecode code size.</strong> Stack-based bytecode programs tend to be smaller, because the operands are implicit. But the register-based program gets to omit the stack manipulation instructions, so the stack-based bytecode should be only a little bit smaller. The memory savings shouldn&#8217;t matter in most environments, but reading less bytecode will save the interpreter a bit of time.</li>
<li><strong>Bytecode instruction count.</strong> As shown above, a register-based program will have fewer bytecodes. Fewer bytecodes to execute means faster run times, if it takes the same length of time to execute a bytecode in each design. Which it doesn&#8217;t:</li>
<li><strong>Operand access.</strong> To access operands, the stack-based interpreter just reads and writes from the top of the stack or very near it. A register-based interpreter needs a two-step process: first it has to get the address of that register, then read or write it. But the register-based version doesn&#8217;t need to adjust the stack. Still, it seems like more work per instruction for the register version: for ADD, the register version needs to compute 3 operand addresses, while the stack version just one (the new stack pointer).</li>
</ul>
<p>Overall, when you look across instructions, I think the total amount of operand addressing computation is about the same. The stack version distributes it across more instructions. The stack version might be able to save a few by having special instructions like LOAD0 (to load slot 0, needing no computation). But the register version still has few instructions, so fewer dispatches. With direct threading, the dispatch is fairly efficient, so this is less important, but an indirect jump is still usually pretty expensive compared to a normal instruction because of the high branch mispredict probability.</p>
<p><strong>Microbenchmarking.</strong> I wrote a tiny interpreter that runs a bytecode program for an empty for loop using doubles as numbers (as in JS) to test out these things for myself. I tried 4 design choices: direct threading vs. switch and stack vs. register. The times for 10M iterations in milliseconds:</p>
<table border="0">
<tbody>
<tr>
<td></td>
<td>Stack</td>
<td>Register</td>
</tr>
<tr>
<td>Switch</td>
<td>230</td>
<td>90</td>
</tr>
<tr>
<td>Direct</td>
<td>105</td>
<td>55</td>
</tr>
</tbody>
</table>
<p>Here you see huge differences. The reason the differences are so big (far bigger than WebKit got) is that my tiny interpreter&#8217;s opcodes are very simple. In a real JS interpreter, the code to run for the average operation is a lot longer, so dispatch overhead is a much smaller fraction of total time, and dispatch overhead is the main thing saved by direct-threading and registers.</p>
<p>Note that the speedup of register vs. stack is smaller in the direct-threaded case, as I predicted above. But not that much smaller, because the dispatch really is expensive compared to everything else in this simple interpreter.</p>
<p>You can get my microbenchmark code <a href="http://people.mozilla.com/~dmandelin/stackreg.cpp">here</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/06/03/squirrelfish/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tamarin Tracing Internals V: Running Compiled Traces</title>
		<link>http://blog.mozilla.com/dmandelin/2008/05/28/tamarin-tracing-internals-v-running-compiled-traces/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/05/28/tamarin-tracing-internals-v-running-compiled-traces/#comments</comments>
		<pubDate>Thu, 29 May 2008 01:12:39 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/?p=18</guid>
		<description><![CDATA[Whew. Reading all this TT code is fascinating, but also tiring, hard work. Anyway, I&#8217;ve hit almost all the high points by now, and I&#8217;ve traced out the JITting process all the way from ABC bytecode to native compiled traces. The questions I have left are about how traces actually get run, plus some related [...]]]></description>
			<content:encoded><![CDATA[<p>Whew. Reading all this TT code is fascinating, but also tiring, hard work. Anyway, I&#8217;ve hit almost all the high points by now, and I&#8217;ve traced out the JITting process all the way from ABC bytecode to native compiled traces. The questions I have left are about how traces actually get run, plus some related questions I&#8217;ve avoided about what side exits really are and how they work.</p>
<p><strong>Running Traces.</strong> The initial entry point into compiled code is back in<strong> Interpreter::loopedge</strong>, the same method that initiates tracing (see Part III). loopedge always checks to see if there is a compiled trace for this loop header. If so, it executes the compiled trace. (Look for the label <strong>callfrag</strong>.) Here&#8217;s the call:</p>
<p style="padding-left: 30px">lr = (*u.func)(&amp;state, 0);</p>
<p>The first argument is a pointer to the interpreter state. I think the second is something used only in debug modes. The result is a pointer to <strong>GuardRecord</strong>, which is defined in Assembler.h. The comment reads: &#8220;These objects lie directly in the native code pages and are used to house state information across the edge of side exits from a fragment.&#8221;</p>
<p>The key member of GuardRecord is <strong>Fragment* target</strong>, which gives the destination fragment (loop header) of the exit. If the destination is not a loop header (target == 0), the destination will be made into a fragment so that it can be traced if it becomes hot. The destination fragment will then get its count incremented, and if it is now hot, tracing starts immediately.</p>
<p><strong>Trace Exits: LIR.</strong> I need to back up a bit in order to fully understand how trace exits work.</p>
<p>During trace recording, branch instructions (e.g., IL LBRT) require special handling. The trace is linear, so we just generate straight-line LIR according to the branch that was actually taken. This is fundamental-we are guessing that since we took a certain branch now on a hot trace, we&#8217;ll probably take the same branch many times more, so the program will run fast if we generate straight-line code for this case. But of course, on any future execution, we&#8217;re not guaranteed to take the same branch again, so when we pass this point, we have to do the check again and exit the trace if the we get the opposite result. The check is called a <strong>guard</strong>, and the exit is called a <strong>side exit</strong>. Here is an example from IL-&gt;LIR trace generation debug output:</p>
<p>T 11D6BE  BRF   -8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10<br />
35 imm   #0<br />
36 eq    33,#0<br />
GG: ip 11D6C2 sp 100E0B4 rp 100616C<br />
45 xf    36 -&gt; 11D6C2</p>
<p>The IL instruction is BRF, &#8220;branch if top-of-stack is false&#8221;. In this case, the top of the stack is d:10, i.e., 10.0, so the interpreter doesn&#8217;t take the branch. But we&#8217;re more interested in tracing. Tracing of branch instructions is implemented by <strong>Interpreter::jump_if</strong>. First, jump_if emits LIR for the test, specifically to test if the top of the stack is zero. This is the &#8220;imm #0&#8243; and &#8220;eq 33, #0&#8243;.</p>
<p>Now comes the scary part, calling <strong>Interpreter::guard</strong>. I would tend to consider the effect of this function more to be generation of a trace exit, but it&#8217;s called <strong>guard</strong>, probably because it generates the branch instruction for side exits. But it is also used for LIR_loop instructions, which don&#8217;t even really have guards.</p>
<p>Naming questions aside, for side exits, as in our example, the first thing guard does is print out the &#8220;GG&#8221; line (if in debug mode). The rest of the line shows some interpreter states, and is probably helpful if debugging TT. Next, guard generates a <strong>SideExit</strong> structure (Fragmento.h) inline with the LIR to describe the exit. The SideExit records:</p>
<ul>
<li>The interpreter state (frame, stack, return, and instruction pointers) as offsets from the interpreter state at the start of the trace.</li>
<li>The trace.</li>
<li>The target of the exit as a fragment, i.e., the (potential) start of another trace.</li>
<li>The current ActionScript call depth.</li>
</ul>
<p>This records interpreter state that is not otherwise encoded. When I went over LIR generation and optimization, I realized that the LIR contains all the store instructions needed to maintain the current interpreter stack data. (Some are optimized away in the dead store elimination pass.) But the LIR doesn&#8217;t update the interpreter state&#8217;s fp, sp, rp, or ip. At every exit we might be going back to the interpreter, so we need to recreate the full interpreter state. The SideExit contains the necessary information.</p>
<p>After writing the SideExit, guard generates an LIR branch instruction. In our example, we should exit if the test is false, so we generate an LIR_xf. Note the gap in instruction sequence numbers-this is because of the space taken up by the SideExit.</p>
<p>guard handles LIR_loop exits (jumps to the trace header) a little differently. Instead of writing a SideExit, guard emits LIR instructions that directly update the interpreter state. I&#8217;m not entirely sure why this is. I also think that in most cases, no adjustments are required, because the interpreter stack size and types should be the same every time control pases a given point. It may have something to do with recursion.</p>
<p><strong>Trace Exits: Native Code.</strong> A trace exit in LIR is a LIR_xf, LIR_xt, LIR_x, or LIR_loop. These all have cases inside <strong>Assembler::gen</strong>. For xf, xt, and x, the assembler calls <strong>asm_exit</strong> to generate exit target code, then generates native JMP/JE/JNE/Jx instructions that branch to the target. For loop, the assembler just generates a JMP instruction.</p>
<p>asm_exit is hard to understand, but I think I have the gist of it. The key action is calling <strong>nFragExit</strong>, which generates the exit target code. This code is generated on a separate page that is allocated for trace exits at the beginning of assembly (_nExitIns is the current position). nFragExit takes the <strong>SideExit</strong> struct as its argument. The SideExit gives the target of the exit as a <strong>Fragment</strong>, which is a loop/trace header that may or may not have a compiled trace. Reading backwards, nFragExit generates code to:</p>
<ul>
<li> Update the interpreter state using the offsets recorded in the SideExit.</li>
<li>Ensure that param 0 of the trace is stored in the standard param 0 argument passing register. This is needed if the exit code is ever set up to jump directly to another trace-that trace will expect param 0 in the usual place. (Param 0 is a pointer to the interpreter state.)</li>
<li>Return a newly created <strong>GuardRecord</strong> (Assembler.h). The GuardRecord is the native code equivalent of a SideExit. Like SideExit, it is stored inline with the code (the native exit code). The GuardRecord is created by <strong>placeGuardRecord</strong> and holds the current fragment, target fragment, and call depth.</li>
<li>Restore the ISA stack pointer (x86 esp).</li>
<li>Jump to the trace epilog.</li>
</ul>
<p>The trace epilog, by the way, is the same for every trace, and on x86 it pops the ISA frame pointer (efp; twice, because it is pushed twice for some alignment reason) and returns. This is just the &#8220;second half&#8221; of the standard C return-from-function sequence.</p>
<p>The exit code can be summarized as updating the interpreter state and then both doing the &#8220;first half&#8221; of return-from-function and preparing a function call to another trace. That way, the ending JMP can be pointed at either the main exit to the interpreter, or made to jump directly to another trace, and either works fine.</p>
<p>Another detail is that if the target of the exit has already been compiled to native code, instead of generating a jump to the trace epilog, nFragExit generates a jump directly to the target trace. (It also skips creating the GuardRecord). This is nice because then the code doesn&#8217;t have to return to the interpreter at all, it just keeps executing native code.</p>
<p>asm_exit wraps the call to nFragExit with a pair of calls to <strong>swapptrs</strong>. This is a macro defined in Native*.h that swaps the pointer to the current position in the native trace code buffer (_nIns) and the current position in the native exit code buffer (_nExitIns). This is just so the macros that generate code can always refer to _nIns as the place to store native code.</p>
<p>Finally, asm_exit does a bunch of fancy register allocation stuff. I don&#8217;t completely understand it, but I think it&#8217;s just needed because the register allocation algorithm is a greedy algorithm for straight line code, and it needs a little tweak when there is a branch. It looks like asm_exit first saves a copy of the allocations and then clears them out so the exit code area has a clear set to to work with, as it should (the only data passed out of the exit via registers are the return value and param 0, which the exit code does set up). Once nFragExit returns, the register tracker now has some allocations for values that are needed in the exit code if any. At this point, <strong>mergeRegisterState</strong> is called with the current register tracker and the saved tracking data to fix everything up. The fixing is basically that if the exit code expects, say, ecx, to contain a certain value, and the main trace has a different value in ecx, a move needs to be generated at the start of the exit code to get the exit code&#8217;s value into ecx.</p>
<p><strong>Reentrancy.</strong> One last thing I want to think about is the issue of reentrancy. We&#8217;ve been told that TT isn&#8217;t reentrant. Specifically, a native method (implemented in C++) can&#8217;t call back into ActionScript. But I never clearly understood why this is. I&#8217;ll probably be wrong about half of this: experts, please jump in and correct.</p>
<p>The problem could exist at multiple levels, but I think the simplest issue is that the interpreter isn&#8217;t reentrant, for the usual reasons of having interpreter-global data structures. For example, a reentrant interpreter would need to have a mechanism for recording the reentry on some sort of stack. Also, if the native method interacts with the Forth stack, the system would need to be very careful about managing that. None of this seems fundamental, just tricky and not done yet.</p>
<p>The other question is what happens to tracing with reentry. One possibility is to stop tracing when entering a possibly reentrant native method, and then possibly start tracing when a native method calls back into ActionScript (i.e., consider a reentry to be a fragment header). This seems like it would work. Another possibility is to allow some declarations on native methods to describe their effects on the interpreter state, so that tracing could actually continue through the reentrant calls. Such a mechanism sounds hard to use, though, and would probably be used only on really important methods in a few places, if at all.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/05/28/tamarin-tracing-internals-v-running-compiled-traces/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tamarin Tracing Internals IV: Trace Optimization</title>
		<link>http://blog.mozilla.com/dmandelin/2008/05/28/tamarin-tracing-internals-iv-trace-optimization/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/05/28/tamarin-tracing-internals-iv-trace-optimization/#comments</comments>
		<pubDate>Wed, 28 May 2008 17:30:05 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/?p=17</guid>
		<description><![CDATA[In part III, I went over how TT generates LIR traces. Now, I&#8217;m going to look into the trace optimization and machine code generation process. The code for this is mostly in the nanojit/ directory.
Keep in mind that a trace is always straight-line code in SSA form. This makes optimizations easier to implement, so it [...]]]></description>
			<content:encoded><![CDATA[<p>In part III, I went over how TT generates LIR traces. Now, I&#8217;m going to look into the trace optimization and machine code generation process. The code for this is mostly in the nanojit/ directory.</p>
<p>Keep in mind that a trace is always <em>straight-line code</em> in <em>SSA </em>form. This makes optimizations easier to implement, so it has a big effect on the design. By the way, a lot of the material on SSA is confusing, and also goes straight into a lot of complexity that&#8217;s not needed for TT, so I&#8217;ll give a quick explanation here.</p>
<p><strong>SSA.</strong> SSA stands for <em>static single assignment</em>, but don&#8217;t bother trying to parse that. It just means that each virtual register in the trace appears on the left-hand side of exactly one assignment statement. This is automatically true for a TT trace, because the virtual register assigned to is implicitly the sequence number of the instruction. Note: locations that are not virtual registers, such as a slot on the data stack, may be assigned to multiple times; the SSA property in TT holds only for virtual registers.</p>
<p>The advantage of SSA is that any time you see a virtual register, you can just look at its one assignment statement, and you immediately know what it was assigned from, what kind of operator was used, whether the operands were constant, etc. Without SSA, there are multiple possible assignments, and an optimizer has to try to discover which assignments can actually reach the current point and then summarize their effects, which is slower and less precise.</p>
<p><strong>Constant Folding.</strong> TT performs <em>constant folding</em> as part of LIR generation. I discussed that in part III along with LIR generation, but I mention it again just to have the full list of optimizations here. Constant folding means transforming code like &#8220;a = 3 + 4&#8243; to &#8220;a = 7&#8243;, i.e., replacing a constant expression with the result of evaluating that expression. I should note that constant folding is also used with branches: a branch instruction with a constant conditional expression is dropped entirely, because the next instruction on the trace is simply the actual target of the branch.</p>
<p><strong>Ending Tracing.</strong> Trace optimization starts when a &#8220;complete&#8221; LIR trace has been generated. In principle, the tracer could stop tracing whenever it wanted to, so there&#8217;s no particular completeness property. I just mean that optimization doesn&#8217;t start until tracing stops. This is necessary because some optimization passes go backward over the trace.</p>
<p>Tracing is stopped by <strong>Interpreter::eot</strong> (end of trace). (<strong>eot</strong> is often invoked by the macro <strong>EOT</strong> defined in Interpreter.h, which records the reason for ending the trace in debug builds only.) Most of eot is just debugging and error checking. The key call is:</p>
<p>compile(strk.location, rtrk.location, assm, tracefrag);</p>
<p>The <strong>compile</strong> function is defined in LIR.cpp and performs constant subexpression elimination, dead store elimination, and assembly. There is code to perform lifetime splitting just before assembly, but it is guarded by <strong>if (false)</strong> in the current code.</p>
<p><strong>Interpreter::eot</strong> is called if any of these conditions occur:</p>
<ul>
<li>(Interpreter::loopedge) The trace follows more than MAX_XJUMPS &#8220;cross-jumps&#8221;. A cross-jump is a backward branch that does not go to the loop header (i.e., the start of the trace). A cross-jump indicates the presence of nested loops. In the current standard configuration, MAX_XJUMPS is zero.</li>
<li>(Interpreter::pre_trace) The trace contains at least MAX_BLKS guards (i.e. side exits). This is checked before tracing each instruction. In the current standard configuration, MAX_BLKS is 10,000. There is a comment next to this check, &#8220;count # of guards to minimize heisenbugs&#8221;. I&#8217;m not sure what this means, but it might be something to make the point at which traces end more deterministic.</li>
<li>(Interpreter::eot_untraceable_prim) The current instruction cannot be traced, e.g., a general native method. This is called by the trace implementation for all untraceable primitives.</li>
<li>(Interpreter::eot_if_max_exits) The trace has returned from over MAX_EXITS ActionScript functions. Recall that tracing can go right through both function calls and returns. Tracing can start in the middle of a function, and the trace could go through many returns. This is called when EXITABC is traced. In the current standard configuration, MAX_EXITS is 2.</li>
<li>(Interpreter::eot_if_max_copies) This one&#8217;s a little tricky. The purpose is to avoid the trace explosion problem: code with k sequential if statements can generate 2^k traces, and if k is too big, this will use up all memory and crash the program. So TT counts the number of copies of an instruction that exist in different traces. If the count exceeds MAX_COPIES, TT ends the trace. Thus, the total memory used can be no more than (#instructions) * MAX_COPIES. Note that TT saves work by counting the number of copies only for CFGMERGE instructions, which are special no-ops that Verifier generates at all control-flow join points. An instruction is copied only if the CFGMERGE above it is copied, so counting CFGMERGEs is good enough. In the current standard configuration, MAX_COPIES is 2.</li>
</ul>
<p>I understand the purpose of <strong>eot_untraceable_prim</strong> and <strong>eot_if_max_copies</strong>, but not the MAX_XJUMPS, MAX_BLKS, and MAX_EXITS conditions, so please comment if you know.</p>
<p><strong>Constant Subexpression Elimination (CSE).</strong> This is a standard compiler optimization. CSE replaces code like this:</p>
<p>x = y + z;<br />
w = y + z;</p>
<p>with code like this:</p>
<p>x = y + z;<br />
w = x;</p>
<p>This saves an operation, and may enable additional optimizations now that w is an exact copy of x.</p>
<p>Given the SSA property, two expressions that apply the same <em>pure</em> operation to the same virtual registers, e.g., &#8220;v1 + v2&#8243; in &#8220;v17 = v1 + v2&#8243; and &#8220;v20 = v1 + v2&#8243;, are always equivalent. A pure operation is one that has no side effects and depends only on its arguments. In TT, this includes basic arithmetic operators and also functions that are marked <strong>PURE-FUNCTION</strong> in Forth, e.g., stringlength.</p>
<p>In TT, CSE is performed by <strong>nanojit::cse</strong> in LIR.cpp (nanojit:: is a namespace qualifier). <strong>cse</strong> performs a single forward scan of the trace, detecting CSEs and replacing them as it goes. This is the classic <em>&#8220;value numbering&#8221;</em> technique. Detection is based on a hashmap where the key is the operator and operands. (See <strong>LInsHashSet</strong> in LIR.cpp, especially <strong>LInsHashSet::_equals</strong>.)</p>
<p>Replacement is performed by overwriting the redundant operation with a special <strong>LIR_tramp</strong> (trampoline) instruction. A LIR_tramp is simply a reference to another instruction in the trace. (In detail, a LIR_tramp has a 24-bit offset operand: if the sequence number of the LIR_tramp is <em>i</em>, and the operand is <em>offset</em>, then the instruction is a reference to the instruction at position <em>i+offset</em>.) My abstract CSE example above might look like this in real LIR with tramps:</p>
<p>15 fadd 5, 6 // 15 is sequence number/destination; 5, 6 are operands<br />
&#8230;<br />
24 fadd 5, 6</p>
<p>becomes:</p>
<p>15 fadd 5, 6<br />
&#8230;<br />
24 tramp -9</p>
<p>LIR_tramp isn&#8217;t an executable instruction. Rather, <strong>X tramp -Y</strong> is a directive: &#8220;Whenever you see <strong>X</strong> as an operand, instead read it as <strong>X-Y</strong>.&#8221; It&#8217;s almost like a macro definition inside LIR. &#8220;Macro expansion&#8221; is performed automatically inside <strong>LIns::oprnd1</strong> (the getter for the first operand of a LIR instruction) and related methods. The effect is that any operand that CSE can detect as equal to X will be <em>named</em> X thereafter. This, in turn, makes it easy to tell that all these Xs are equal and exposes more opportunities for CSE.</p>
<p>Here&#8217;s an example:</p>
<p>Treating tramps in this way is part of value numbering and helps optimize code like this:</p>
<p>3 fadd 1, 2<br />
4 fadd 1, 2<br />
&#8230;<br />
15 fadd 3, 8<br />
16 fadd 4, 8</p>
<p>3 and 4 are clearly redundant. So are 15 and 16, but they have syntactically different operands, so it looks like CSE will miss them. But it doesn&#8217;t. It converts the first two instructions to:</p>
<p>3 fadd 1, 2<br />
4 tramp -1</p>
<p>At this point, the next two instructions are now read as:</p>
<p>15 fadd 3, 8<br />
16 fadd 3, 8</p>
<p>and CSE gets the second redundancy as well. This example also shows why CSE is done as a forward pass: CSE applied at one point may create more opportunities for CSE farther down, so we can do all the CSEs in one pass only if we go forward.</p>
<p><strong>Redundant Store Elimination (RSE).</strong> This is a restricted type of dead code elimination (DCE). Informally, RSE replaces this:</p>
<p>x = a + b;<br />
x = y + z;</p>
<p>with this:</p>
<p>x = y + z;</p>
<p>The first value of x is overwritten before it can be used anywhere else, so the first statement can be eliminated if it has no side effects.</p>
<p>Note that there are 2 assignments to x here in my example, so it is not in SSA form. This is correct: this TT pass is only applied to store instructions (<strong>LIR_st</strong> and <strong>LIR_sti</strong>), which can write to the same location multiple times.</p>
<p>General DCE can also eliminate instructions whose values are not overwritten but are never read anyway, but we can&#8217;t do that with TT stores. The reason is that TT stores write values to the interpreter stack, which may be used later once we exit the trace, by the interpreter or the next trace. While looking at the current trace, we really have no idea which of those values is used later on, so we have to keep them all. But TT does apply general DCE to non-store instructions, and for those instructions DCE is incredibly easy.</p>
<p>TT RSE is performed by <strong>nanojit::rmStores</strong> as a backward pass. rmStores scans the instructions, keeping track of (a) the depth of the stack, and (b) which stack positions (starting from the bottom) are stored to. rmStores does the tracking for both the data stack (sp) and the return stack (rp). For each store instruction, rmStores determines the stack position stored to, and removes the store if that position is (a) above the top of the stack, or (b) is stored to later on (i.e., earlier in the backward scan). Case (b) is just as in the example above. Case (a) picks up situations like storing a value to the top of the stack and then DROPping it.</p>
<p>rmStores must handle side exits specially. As above, we have to make sure the interpreter stacks are &#8220;correct&#8221; (i.e., look exactly how they would look if the interpreter had been running) when we exit the trace. This applies to side exits as well. So when the backward scan passes a side exit, it must mark everything on the stack as potentially live (by clearing the &#8220;stored-to&#8221; bits in its scan record).</p>
<p>This is important: it shows that side exits preclude some optimizations.</p>
<p><strong>Assembly.</strong> This converts LIR to ISA (<strong>instruction set architecture</strong> code-the TLA way to say &#8220;machine language&#8221;).</p>
<p>The TT assembler performs register allocation (mapping the unlimited virtual registers to the very limited ISA registers) simultaneously. Offline compilers do register allocations by applying approximation algorithms to NP-hard graph-coloring problems, but the compilation time is too long for a JIT like TT, so TT uses an integrated single-pass greedy allocator. Note that nanojit/RegAlloc.h is not the register allocator: it&#8217;s just a data structure for tracking register mappings and free registers.</p>
<p>Assembly is platform-specific, so TT needs a mechanism to build different assemblers for different platforms. Here&#8217;s one of the key bits (in nanojit.h):</p>
<p>#include &#8220;Native.h&#8221;<br />
#include &#8220;LIR.h&#8221;<br />
#include &#8220;RegAlloc.h&#8221;<br />
#include &#8220;Fragmento.h&#8221;<br />
#include &#8220;Assembler.h&#8221;</p>
<p>Native.h includes another file, Native*.h (e.g. Nativei386.h), controlled by preprocessor defines. This imports platform size constants, register set definitions, and macros for code generation (e.g., CALL). Assembler.h defines the <strong>Assembler</strong> class. Assembler.cpp contains platform-independent assembler logic, including methods of Assembler, customized by referring to variables and macros defined in the platform-specific header. There&#8217;s also some stuff controlled by defines like NANOJIT_IA32, generally short code snippets that interact closely with otherwise platform-independent code. Finally, there is a Native*.cpp, which contains other methods of Assembler that are defined in a purely platform-specific manner.</p>
<p>The TT assembler works backward. I think this is because it does a few last optimizations which work best in a backward pass. Keep this in mind reading methods like Assembler::genEpilog-the first instruction generated is the return.</p>
<p>Assembler::assemble is the entry point, and Assemble::gen is where the per-instruction work is done.</p>
<p>Assemble::gen a lot of details-I&#8217;ll just look at an examples. Here&#8217;s how a LIR_imm (place a constant value &#8220;immediate operand&#8221; in a virtual register) is assembled:</p>
<p>case LIR_imm:<br />
case LIR_imm32:<br />
{<br />
Register rr = prepResultReg(ins, GpRegs);<br />
int32_t val;<br />
if (op == LIR_imm32)<br />
val = ins-&gt;imm32();<br />
else<br />
val = ins-&gt;imm16();<br />
if (val == 0)<br />
XOR(rr,rr);<br />
else<br />
LDi(rr, val);<br />
break;<br />
}</p>
<p>The first step is to call prepResultReg to pick a register to store the result in. I&#8217;ll look at that later, but for now I assume it just works. The next step is to get the constant value itself from the LIR instruction. Finally, we call the LDi macro to generate the instruction, unless the constant is zero, in which case we just XOR the register with itself (x XOR x = 0 for all x), which is faster (although I don&#8217;t know why). The macros aren&#8217;t very exciting reading&#8211;they just do the bit-bashing to generate ISA opcodes, addressing mode bits, and operand encodings.</p>
<p><strong>Register Allocation.</strong> This code is pretty complicated, but I think I can outline what it does. The algorithm is conceptually simple; the complexities come from dealing with platform-specific details and special cases.</p>
<p>The algorithm tracks the set of free registers and the mapping from virtual registers to machine storage. The machine storage is represented by the Reservation class, which can name an ISA register, an activation record (stack frame) location, or both.</p>
<p>The first step for most instructions is to allocate registers to use for the operands.</p>
<p><strong>Assembler::findRegFor</strong> finds a register that holds the result of a given LIR instruction (e.g., an operand). If the LIR instruction has already been assigned a register, it returns that register. Otherwise, it searches for a free register and records the mapping.</p>
<p>One of the complexities is the second argument to findRegFor, <strong>RegisterMask allow</strong>. allow represents a set of allowed registers-the returned register must be in this set. This is needed because some operations can only be used on certain registers. In some cases, the value can&#8217;t be allocated directly in the allowed set, e.g., because it is computed by an instruction that cannot output to that register. Then TT issues an extra move instruction.</p>
<p>It is possible that there is are no free registers. In this case, the solution is to spill a register. This means we pick a victim LIR instruction currently in a register, and store and load it around the current instruction. As a general example, we might have two different values that need to be computed into eax:</p>
<p>mov eax, ebx  // first instruction writing eax<br />
add ?, ecx       // want to use eax, but it&#8217;s occupied<br />
add esi, ?        // want eax from previous<br />
add edx, eax   // first eax again</p>
<p>We spill the first writer of eax like this:</p>
<p>mov eax, ebx<br />
mov ebp[-8], eax // spill eax to memory<br />
add eax, ecx        // eax now available<br />
add esi, eax<br />
mov eax, ebp[-8] // restore first eax<br />
add edx, eax</p>
<p>That&#8217;s simple enough, but it looks kind of tricky in TT, because TT doesn&#8217;t see the code all at once to make this transformation, but instead does adds the spill and restore code during its backward pass.</p>
<p>In this example, TT would detect that a spill is required when assembling &#8220;add esi, ?&#8221;. It needs to use eax, but eax is already in use, so at this point, it knows that it has to emit code to restore the victim value to eax. This is done by <strong>Assembler::</strong><strong>asm_restore</strong>, which finds a free memory location for the victim and emits code to restore from that location. It also records that memory location in the victim&#8217;s reservation so it will know to spill the value later on. Note that for constant values, asm_restore knows it doesn&#8217;t need to load them from memory, but can just use an LDi (load immediate).</p>
<p>Once the operands are allocated registers, the assembler selects a register for the result using  Assembler::<strong>prepResultReg</strong>. This again calls findRegFor. But in this case, a register was probably already allocated by an later instruction that uses this result, and that register will be returned directly. Now, if the register we select was previously selected to be spilled (in asm_restore), we need to generate the spill code. This is done by calling <strong>Assembler::asm_spill</strong>, which checks the reservation to see if a memory location has been allocated to this instruction&#8217;s result. If so, a store instruction is generated.</p>
<p>The register allocator is closely related to the DCE (dead code elimination) mechanism. Assemble::gen calls <strong>Assemble::ignoreInstruction</strong> on each instruction to see if no code should be generated. The basic idea is that if no storage has been reserved for the result of an instruction, then nothing ever reads the result, so it is dead (as long as there are no other side effects). LIR_tramps are always ignored, which fits with my earlier description of their being a special kind of nop.</p>
<p>Note that all of this is done backwards. Even the debug output is generated backwards and then reversed for printing. So if you want to read the assembler output in the order that TT processes things, read it backwards.</p>
<p><strong>Assembly/Register Allocation Example. </strong>Here&#8217;s a bit of debug output from the assembler, showing a spill/restore:</p>
<p>58 qlo   47<br />
010E26AE  movd ecx,xmm0                esi(8) edi(6) xmm0(47)<br />
spill 58<br />
010E26B2  mov -8(ebp),ecx                ecx(58) esi(8) edi(6)<br />
60 eq    58,#0<br />
69 xt    60 -&gt; 11DED2<br />
010E26B5  test ecx,ecx                         ecx(58) esi(8) edi(6)<br />
010E26B7  je 10E3512                           ecx(58) esi(8) edi(6)<br />
GID 49<br />
71 arg   #4<br />
010E26BD  mov edx,4                            ecx(58) esi(8) edi(6)<br />
72 arg   58<br />
73 call  getslotvalue_box<br />
010E26C2  call 309E0:getslotvalue_box        esi(8) edi(6)<br />
restore 58<br />
010E26C7  mov ecx,-8(ebp)                      esi(8) edi(6)</p>
<p>The lines that begin with decimal numbers are LIR instructions. The indented lines that begin with hex addresses are the generated assembly. The assembly lines also show the current mapping of LIR instruction results to ISA registers. There are also some notes about 58 being spilled and restored. The &#8220;GID&#8221; line indicates a guard.</p>
<p>Let&#8217;s read it bottom to top. The last LIR instruction is &#8220;call getslotvalue_box&#8221;, which is a native call. Native calls potentially overwrite certain registers, including ecx. The is currently a value in ecx, the result of instruction 58. (This reservation was made earlier, i.e., farther down in the trace.) This value must be spilled. But that will happen further up. For now, TT just selects a spill location, -8(ebp), and emits the code to restore that location to ecx. Now, TT can emit the call instruction.</p>
<p>The previous instruction is &#8220;arg 58&#8243;, which means to load 58 into an argument storage location to passed to a call. The LIR_arg instruction encodes the storage location, and it&#8217;s not shown the debug output, but apparently this one is supposed to go in ecx. Because the value is already in ecx, no code is necessary, and none is generated. This logic is accomplished by the function <strong>Assembler::findSpecificRegFor</strong>, which is a thin wrapper that just calls <strong>findRegFor</strong> with a single allowed register. As explained above, if the allowed register can be reserved for that instruction, it is, and if not, a move is emitted.</p>
<p>The previous instruction is now &#8220;arg #4&#8243;, which means to load literal integer 4 into an argument storage location, this time edx. The argument is a constant, so all this has to do is emit an LDi instruction. There is no need to allocate registers because edx is potentially written by the call, so if an instruction was using edx, we would have selected it for spilling when we processed the call. At this time, edx is automatically available.</p>
<p>The previous instruction is &#8220;xt 60 -&gt; 11DED2&#8243;, an &#8220;exit if true&#8221; instruction. Exiting from traces is complicated, so I&#8217;m going to leave most of this for later. For now, just note that this instruction generates both the comparison instruction <strong>test</strong> and the branch instruction <strong>je</strong>. This is because of a classic compilation issue, which is that relational operators get compiled differently according to whether the result is used by a branch or by an arithmetic operator. TT&#8217;s solution is to compile the relational operator as part of compiling the branch, and then ignore it later, as shown on this trace.</p>
<p>Now we reach the first LIR instruction, &#8220;qlo 47&#8243;, which picks out the &#8220;low&#8221; half of a quad (64-bit operand). The result has been reserved as ecx, but memory has also been reserved (when we generated restore code earlier), so we know we need to spill the result now. After that, we can generate the move instruction for qlo.</p>
<p><strong>Next time:</strong> running compiled traces.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/05/28/tamarin-tracing-internals-iv-trace-optimization/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tamarin Tracing Internals III: LIR</title>
		<link>http://blog.mozilla.com/dmandelin/2008/05/23/tamarin-tracing-internals-iii-lir/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/05/23/tamarin-tracing-internals-iii-lir/#comments</comments>
		<pubDate>Sat, 24 May 2008 01:38:07 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[tamarin]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/?p=16</guid>
		<description><![CDATA[Program Form 3: LIR. I believe LIR stands for low-level intermediate representation (although I&#8217;ve also heard linear intermediate representation). Typically, in a compiler or VM LIR is the lowest-level (and last) form of machine-independent compiler representation, and looks much like a machine-independent assembly language. TT&#8217;s LIR plays the same role but has some special features [...]]]></description>
			<content:encoded><![CDATA[<p><strong>Program Form 3: LIR.</strong> I believe LIR stands for <em>low-level intermediate representation</em> (although I&#8217;ve also heard <em>linear intermediate representation</em>). Typically, in a compiler or VM LIR is the lowest-level (and last) form of machine-independent compiler representation, and looks much like a machine-independent assembly language. TT&#8217;s LIR plays the same role but has some special features because it is designed specifically for efficient trace compilation. The most important feature is that a LIR trace is always straight-line code, with one or more exit points, but no targeted branches. (Actually, there has to be a target, but the target is not part of the LIR, it&#8217;s the address of an IL instruction.)</p>
<p>The LIR instruction set is defined in nanojit/LIR.h. LIR is much simpler than IL, with fewer than 64 opcodes. Most of them are familiar, e.g., LIR_add for integer addition. LIR_imm (&#8221;immediate&#8221;) sets a constant value, contained directly in the instruction word. All the control-flow instructions are exits: e.g., LIR_x to exit unconditionally, or LIR_xt to exit if a condition is true. The exception is LIR_loop, which jumps back to the start of the trace. TT traces always start at loop headers, so this is important. See also Mason Chang&#8217;s post on the LIR instruction set.</p>
<p>A LIR instruction is 32-bits, with 8 bits for the opcode and up to 3 operands. As with machine ISAs, there are different forms of instruction words to accommodate multiple operands, 24-bit immediate operands, etc.</p>
<p>LIR operands work differently to IL operands. Recall that IL is stack-based, so IL&#8217;s IADD takes its two arguments off the stack and pushes their sum. Because machine languages like x86 are register-based, not stack-based, LIR is also register-based. More precisely, LIR uses <em>virtual registers </em>(my terminology, not TT&#8217;s), which just means it can have as many registers as it wants. Mapping those registers onto the finite set available on each processor is the job of a lower-level register allocator.</p>
<p>LIR has an interesting implicit scheme of naming the virtual registers. Each LIR instruction has a sequence number, according to its position in the trace. The sequence number is (implicitly) also the name of the virtual register where the result is stored. Later instructions can use this result by sequence number. For example,</p>
<p>// Instruction 13: test whether result of instruction 12 equals immediate #48<br />
13 eq    12,#48<br />
// Instruction 22: exit if result of instruction 13 false<br />
22 xf    13 -&gt; 10C41CA</p>
<p>This design saves space, because result operands don&#8217;t need to be named implicitly. It also automatically represents instruction dependencies. Finally, because of this design, LIR is necessarily in SSA form, which will make optimizing traces easier.</p>
<p><strong>Transformation 2-&gt;3: IL-&gt;LIR. </strong>This is the actual trace generation. So while the earlier steps translated a method at a time, in this step TT operates on a trace (code path) at a time. The basic idea is that while TT executes IL, it will generate a trace of LIR instructions for the code path it follows.</p>
<p><strong>Activating Tracing. </strong>TT traces always start at loop headers. A loop header is any target of a backward branch-the IL generator ensures backward branches are loop edges. Recursive methods are also a form of loop, and so TT treats a recursive call the same as a loop edge.</p>
<p>Every time the interpreter follows a loop edge, it checks to see if it should begin tracing. This is encoded into vm_*_interp.h with the INTERP_CHECK_LOOPEDGE macro, which wraps a call to Interpreter::loopedge. Interpreter::loopedge maintains a count of the number of times it has seen each loop header address. If the count exceeds a threshold (HOTLOOP, which is 10 in the current version), it calls Interpreter::sot (&#8221;start of trace&#8221;) to start tracing. sot initializes data structures used by tracing, emits some boilerplate prolog LIR, and sets the interpreter tracing state flag. Finally, control will exit from the interpreter (Interpreter::do_interp, but the return statement is found in the macro _CHECK_MODESWAP, used in INTERP_CHECK_LOOPEDGE), with a return value that tells the main system loop to enter tracing mode.</p>
<p><strong>Tracing. </strong>Actual tracing of IL is performed by VMInterp::do_trace. do_trace continues interpreting IL, just as in the interpeter (do_interp). In addition, before interpreting each IL instruction, do_trace emits LIR for the instruction. For example, here is the do_trace implementation of my favorite IL op, IADD, from vm_fpu_trace.h:</p>
<p>INTERP_FOPCODE_TRACE_BEGIN(IADD)<br />
interp.trace_binop(LIR_add, sp);<br />
INTERP_FOPCODE_TRACE_END(IADD)<br />
INTERP_FOPCODE_INTERP_BEGIN(IADD)<br />
const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;<br />
INTERP_ADJUSTSP(-1)<br />
sp[0].i = tmp_i_0;<br />
INTERP_INVALBOXTYPE(sp[0])<br />
INTERP_FOPCODE_INTERP_END(IADD)</p>
<p>The second part is an exact copy of the code in vm_fpu_interp.h. I believe this code is produced by the Forth compiler. After preprocessing, the block above looks like this (in VMInterp.ii):</p>
<p>foplabel_TRACE_IADD: { {<br />
interp.trace_binop(LIR_add, sp);<br />
} ip += 1; {<br />
const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;<br />
sp += (-1);<br />
sp[0].i = tmp_i_0;<br />
} goto _goto_ip; }</p>
<p>The only addition for tracing is the call to interp.trace_binop(LIR_add, sp). In principle, trace_binop has a simple job: emit a LIR opcode for a binary operation. In reality, the tracer does some optimizations along the way and also must do some state bookkeeping.</p>
<p><strong>trace_binop. </strong>Here is the method signature for trace_binop:</p>
<p>void Interpreter::trace_binop(_LOpcode op, const Boxp sp, BoxUsage insize, BoxUsage outsize);</p>
<p>op is the LIR opcode to emit in the trace. sp points to the top of the interpreter operand (Forth) stack. The other two arguments give the operand sizes, because operands can be 32 or 64 bytes, but I&#8217;m not going to worry about that just yet, and that aspect of the code is well-separated from the main logic.</p>
<p>Here is the body of trace_binop:</p>
<p>1        if (check_const_defc(2, sp, outsize)) return;<br />
2        LirWriter* lirout = tracefrag-&gt;lirbuf-&gt;writer();<br />
3        LInsp i = lirout-&gt;ins(LOpcode(op), use_q_or_lo(insize, sp-1), use_q_or_lo(insize, sp));<br />
4        varset(sp-1, i);</p>
<p>Line 1 tries a constant-folding optimization: informally, if both of the operands are constants, the tracer will compute the result, which is constant, and emit a LIR_imm instead of the binary operation. But what exactly is a constant operand in LIR? Recall that an operand is just the index of an instruction that computes a value. If that instruction is a LIR_imm, then the operand is constant.</p>
<p>Here I should say a bit about the region tracker. (Fortunately, Graydon explained it to me, so I didn&#8217;t have to work hard to figure out that part.) Recall that in the interpreter, the operands are the top two stack elements. At the start of trace_binop, that&#8217;s all we know. So in order to get the LIR operands for tracing, trace_binop has to map the stack values to their corresponding LIR operands (i.e., the LIR instructions that computed those stack values).</p>
<p>The region tracker, (class RegionTracker in Interpreter.h), maintains this mapping and performs the lookup. Specifically, RegionTracker mains a map from addresses (const void *) to instructions in the LIR trace buffer (LIns *). The addresses are considered to address fixed-width elements in a range starting from a zero position. This is perfect for tracking a stack. Also, the mapping can be implemented as array lookup, which is simple and fast.</p>
<p>Region tracker operations are wrapped for the interepreter by inline methods like varof, which maps a Box* interpreter stack operand to a LIR instruction, and varset, which emits a store operation for a result and updates the region tracker with the new result.</p>
<p>Back to trace_binop. Line 2 just accesses the LirWriter (LIR.h), the class that does the bit-packing to create LIR instructions.</p>
<p>Line 3 emits the binary operation LIR instruction. The only fancy part is in use_q_or_lo, which uses the region tracker to map an interpreter stack operand to the LIR instruction that created it.</p>
<p>Line 4 emits a LIR_sti to store the result (and updates the region tracker accordingly). This surprised me, because if the result of a LIR operation is implicitly stored in its instruction sequence number, why is an explicit store needed? Looking at example traces, I see that there are few or no corresponding LIR_ld instructions, so most of these stores must be dead. I think the LIR_sti is storing to a memory location that also represents the result, and will be used later for spilling by the register allocator.</p>
<p><strong>Example. </strong>LIR is so low-level that nontrivial IL tends to create more LIR than I want to put here, but I did find a readable example of adding 1 to a number. This is from the -Dverbose output of avmshell (\ is a line continuation-I added a line break to fit this format):</p>
<p>T 11E296     op_OP_increment_plus_000a \<br />
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10<br />
325 int   #11E298<br />
327 sti   #11E298, #4(6)<br />
T 11F4E6      LITC1 \<br />
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10<br />
328 imm   #1<br />
330 sti   #1, #16(8)<br />
T 11F4E8      I2D \<br />
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 -3:1<br />
333 quad  #3FF00000:0<br />
335 sti   333, #16(8)<br />
T 11F4EA      DADD \<br />
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 d:1<br />
336 fadd  321,333<br />
338 sti   336, #8(8)</p>
<p>The lines beginning with &#8220;T&#8221; show IL instructions. The &#8220;T&#8221; itself indicates that the line was printed while the interpreter was in tracing mode. The next number is the address of hte IL instruction. After that comes the IL opcode. Finally, the top 8 elements of the interpreter stack are shown, with the topmost at the right. The format of the stack element is type:content, where type is a boxtype constant from Box.h and content is a hex representation of the value.</p>
<p>The indented lines following each &#8220;T&#8221; line show the LIR generated for that IL instruction. The LIR output format is similar to typical assembly languages, and is:</p>
<p>instruction-sequence-number opcode operands</p>
<p>Operands that are just numbers are references to other instructions by sequence number. Operands like #8 are immediate operands. #16(8) is read &#8220;#offset(base pointer)&#8221;, i.e., it says to take the value computed by instruction 8 as a pointer, then add 16 to that pointer. Most of the bases seem to be for instructions in the trace prolog that load things like the stack pointer. In particular, at least in this trace, the base pointer 6 is the interpreter return pointer and the base pointer 8 is the interpreter stack pointer.</p>
<p>(Which leads me to ask, what are the return pointer and the stack pointer? They are part of the interpreter state (InterpState in Interpreter.h), which has 4 fields: the frame pointer, Frame* f; the instruction pointer, FOpcodep ip; the stack pointer, Boxp sp; and the return pointer, FOpcodepp rp.</p>
<p>* A Frame (StackTrace.h) represents an activation record (or &#8220;stack frame&#8221;) of the ActionScript code. Thus, Frames are pushed onto and popped from a stack as ActionScript methods are entered and exited. This stack is used for generating stack traces and security checks that depend on ActionScript execution context.<br />
* The instruction pointer points to the currently executing (Forth) IL instruction.<br />
* The stack pointer points to the top of the Forth data stack.<br />
* The return pointer points to the top of the Forth return stack. This is the stack that implements call and return from Forth subroutines.)</p>
<p>Let&#8217;s see if I can understand how the tracer works:</p>
<p>T 11E296     op_OP_increment_plus_000a \<br />
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10<br />
325 int   #11E298<br />
327 sti   #11E298, #4(6)</p>
<p>This first step is a call to a superinstruction. A just means pushing the return address (note that it is the current address plus 2) onto the return stack: #4(6) means an offset of 4 from the start of the return stack. In the LIR, the first instruction loads a 32-bit immediate value, and the second instruction stores that value on the return stack. Note also the missing sequence number: LIR instructions are 32 bits, so loading a 32-bit value is done with a load instruction followed by the value.</p>
<p>T 11F4E6      LITC1 \<br />
-8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10<br />
328 imm   #1<br />
330 sti   #1, #16(8)</p>
<p>LITC1 (&#8221;literal constant 1&#8243;?) pushes the integer value 1. Here we load the immediate value 1, then store the value 1 on the data stack (recall that 8 is the stack pointer). I&#8217;m not sure why there is a missing sequence number: I think LIR_imm is just one instruction.</p>
<p>T 11F4E8      I2D \<br />
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 -3:1<br />
333 quad  #3FF00000:0<br />
335 sti   333, #16(8)</p>
<p>Note that the value &#8220;-3:1&#8243; has been pushed onto the data stack at the right. I think in this case, the value is actually an unboxed int, so the -3 is just whatever happened to be in that memory location before, and only the 1 is significant.</p>
<p>I2D converts an integer to a double. Here you see we are using LIR_quad (load an immediate 64-bit &#8220;quadword&#8221;) to push the IEEE 754 double value &#8220;3FF0000000000000&#8243;, more commonly known as 1.0. So there is no conversion code at all: TT has constant-folded the I2D operation because its operand is the constant int 1.</p>
<p>T 11F4EA      DADD \<br />
-3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:10 d:1<br />
336 fadd  321,333<br />
338 sti   336, #8(8)</p>
<p>Now in the data stack view, the last value is &#8220;d:1&#8243;, which means double value 1.</p>
<p>Finally, we do a DADD, or add doubles. This uses LIR_fadd, the floating-point addition operator, on operands 321 (not shown here) and 333, the LIR way to refer to the 1.0 we loaded previously. Finally, we store the result on the stack. Note that we store it 8 units down from the result of I2D: #8(8) instead of #16(8). This is because the stack has one fewer 8-byte element, as DADD has popped two operands and pushed only one.</p>
<p>Here&#8217;s the state after this instruction:</p>
<p>T 11F4EC      DUP          -8:0 -3:10AF520 -3:10AD150 -3:10AD240 -2:0 -2:0 -3:10AD240 d:11</p>
<p>The top of the stack is now 10+1=11. Way up above, I said that a loop header is considered hot and gets traced after 10 hits. And here we just incremented something by 1, and get 11. So even though there&#8217;s a ton of LIR before this, we know this is actually working on the variable &#8220;i&#8221; in the original program.</p>
<p>Next time: trace optimization.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/05/23/tamarin-tracing-internals-iii-lir/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tamarin Tracing Interals, Part II: Forth</title>
		<link>http://blog.mozilla.com/dmandelin/2008/05/21/tamarin-tracing-interals-part-ii-forth/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/05/21/tamarin-tracing-interals-part-ii-forth/#comments</comments>
		<pubDate>Thu, 22 May 2008 02:16:51 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/?p=15</guid>
		<description><![CDATA[The Need for Forth Subroutines. I had a really hard time tracking down how TT adds a pair numbers (ActionScript code like &#8220;sum += i&#8221;) worked until I finally figured out that ECMAScript &#8220;+&#8221; is not a primitive operation in TT. This makes perfect sense now, as &#8220;+&#8221; is complicated: it has to do different [...]]]></description>
			<content:encoded><![CDATA[<p><strong>The Need for Forth Subroutines.</strong> I had a really hard time tracking down how TT adds a pair numbers (ActionScript code like &#8220;sum += i&#8221;) worked until I finally figured out that ECMAScript &#8220;+&#8221; is not a primitive operation in TT. This makes perfect sense now, as &#8220;+&#8221; is complicated: it has to do different things depending on the argument types (floating-point addition, string concatenation).</p>
<p>It would still be possible to make &#8220;+&#8221; an IL primitive implemented in C++, but it&#8217;s better not to, and TT doesn&#8217;t. The reason is that TT wants to be able to specialize the code for &#8220;+&#8221; when tracing. For example, if the arguments are doubles on the trace, then TT could emit only the floating-point addition code, which is faster and smaller than the general code. But this is hard to do right if &#8220;+&#8221; is an IL primitive. In that case, the tracer would need complex C++ code to perform the specialization for every primitive of this kind. And even the slightest difference between the tracer&#8217;s logic and the interpreted code would cause weird VM bugs. As Graydon told me, TT tries hard to avoid this kind of redundancy.</p>
<p>Instead, TT implements ECMAScript &#8220;+&#8221; (IL OP_add) as a subroutine of IL instructions. I&#8217;m not sure what the official TT terminology is, but <em>extern</em> is the word used in code identifiers. To execute an extern in interpreted mode, the system simply jumps to that subroutine, executes the IL instructions that carry out the case logic, and returns when done. In tracing mode, the system can trace and optimize the extern&#8217;s IL just as it does any other IL.</p>
<p><strong>Basic Forth. </strong>Externs are written in Forth. (The Adobe guys explained about Forth in comments to my last post.) Forth is pretty close to IL, so the Forth compiler can be pretty simple (fc.py, 1900 lines of code). I don&#8217;t know Forth, but I did once do a bunch of HP-28S programming, which used a Forthlike language.</p>
<p><a href="http://www.masonchang.com/search/label/Tamarin">Mason Chang</a> has some good <a href="http://www.masonchang.com/2008/03/tamarin-linking-forth-and-c.html">pointers on Forth</a>. I&#8217;ll also give a quick summary of what I found out. In Forth, the state of the system is pretty much just a stack, and Forth code is pretty much just a sequence of stack operations separated by spaces. The stack operations are called <em>Forth words</em> (&#8221;word&#8221; as in &#8220;command&#8221;, no relation to machine words). Take this code snippet:</p>
<p>0 i2d</p>
<p>&#8220;0&#8243; is a Forth word that means &#8220;push 0 onto the stack&#8221;. &#8220;i2d&#8221; is another Forth word, a TT primitive operator that converts the top of the stack from an int to a double. So the total effect of this snippet is to push double 0.0.</p>
<p>We can package this as a Forth subroutine named &#8220;float0&#8243; just like this:</p>
<p>: float0 0 i2d ;</p>
<p>&#8220;float0&#8243; is now a Forth word, usable like any other. The colon and semicolon start and end a definition. (I think colon and semicolon are &#8220;officially&#8221; Forth words themselves, although the TT Forth compiler fc.py treats them more like syntax.)</p>
<p><strong>Cases. </strong>Another interesting Forth feature is case &#8220;statements&#8221; (not sure what the correct term is). I think this is a typical feature but the exact way it is defined in TT is specific to TT. In TT, a Forth word can be defined as a case statement, which seems to be used mostly for dynamic dispatch. E.g.:</p>
<p>CASE: toboolvec ( xvalue bt &#8212; bool )<br />
( 0 BoxedDouble ) d2b<br />
( 1 BoxedNull ) no-op<br />
&#8230;<br />
( 8 BoxedInt ) i2b ;</p>
<p>This defines a new Forth word toboolvec, which converts a value to a boolean. (I think &#8220;vec&#8221; is a TT conventional ending for case-defined words.) TT Forth CASE is pretty tricky, and I had to look around for a while to figure out how what it really does.</p>
<p>The first thing to note is that &#8220;( &#8230; )&#8221; is a comment in Forth. The first comment in the CASE above is a stack comment, documenting the effect of the word on the stack. This word pops 2 inputs: xvalue, a boxed value to convert to a boolean, and bt, a boxtype constant that represents the type of xvalue. The word then pushed one value: the boolean representation of xvalue.</p>
<p>Second, unlike a C switch, these CASEs don&#8217;t represent the conditions for each case explicitly. Rather, a CASE pops the top value (which should be boxtype, here), which must be an integer k, and then executes the kth word of the body of the case. The comments in the case body document this relationship and make it a lot easier to read.</p>
<p>toboolvec is used by an easier-to-use word, tobool:</p>
<p>: tobool ( xvalue &#8212; bool ) boxtype toboolvec ;</p>
<p>boxtype ( x &#8212; x i ) pushes the boxtype constant indicating the type of the top of the stack, which must be a boxed value. So tobool just converts the top of the stack to a boolean.</p>
<p><strong>OP_add.</strong> Now I actually know enough to understand how OP_add is defined. It&#8217;s defined in vm.fs in several pieces, but basically it uses the boxtype word to get the type of both operands, then case words that double dispatch to a type-specific addition operator. When adding two doubles the final addition operator is w_fadd. This is defined:</p>
<p>EXTERN: w_fadd f+ dbox ;</p>
<p>f+ is another name for DADD, which defined:</p>
<p>PRIM: DADD (( f0 f1 &#8212; fr ))<br />
interp:{ fr = f0 + f1 ; }<br />
trace:{ interp.trace_binop(LIR_fadd, sp, USE_Q, USE_Q); } ;</p>
<p>(Mason explained <a href="http://www.masonchang.com/2008/04/forth-language-interpreter-in-c.html">how primitives work</a>.)</p>
<p>Sadly, it appears from reading the int+int case that when adding two ints, TT must promote them to doubles because ECMAScript requires numeric addition to work as if the operands were doubles. I wonder how much of a performance penalty this is for array-iteration code and if there are any opportunities to optimize this by figuring out when it&#8217;s safe to keep the int representation, perhaps by loop variable interval analysis or speculation&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/05/21/tamarin-tracing-interals-part-ii-forth/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Tamarin Tracing Internals, Part I</title>
		<link>http://blog.mozilla.com/dmandelin/2008/05/16/tamarin-tracing-internals-part-i/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/05/16/tamarin-tracing-internals-part-i/#comments</comments>
		<pubDate>Sat, 17 May 2008 01:53:18 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[tamarin]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/?p=13</guid>
		<description><![CDATA[Tamarin (technically, tamarin-tracing, henceforth TT)-related projects keep peeking up at me from the horizon. First, there&#8217;s a good chance I&#8217;ll have an intern working on TT this summer. And then there&#8217;s this &#8220;Tracehydra&#8221; idea. It&#8217;s a way to connect Spidermonkey&#8217;s JS parser with the TT execution engine. This is the plan:

Where &#8220;profit&#8221; means &#8220;run Javascript [...]]]></description>
			<content:encoded><![CDATA[<p>Tamarin (technically, tamarin-tracing, henceforth <i>TT</i>)-related projects keep peeking up at me from the horizon. First, there&#8217;s a good chance I&#8217;ll have an intern working on TT this summer. And then there&#8217;s this &#8220;Tracehydra&#8221; idea. It&#8217;s a way to connect Spidermonkey&#8217;s JS parser with the TT execution engine. This is the plan:</p>
<p><img src="http://blog.mozilla.com/dmandelin/files/2008/05/tracehydra.png" alt=""></p>
<p>Where &#8220;profit&#8221; means &#8220;run Javascript really fast&#8221;. Tracehydra would be the fluffy cloud that translates Spidermonkey bytecode to Tamarin IL (or possibly LIR-the details get confusing fast). (In the interest of reducing confusion slightly, I&#8217;ll say that <b>IL </b>stands for <i>intermediate language</i>, and is roughly a synonym for <i>bytecode</i>. TT people often refer to their IL as &#8220;Forth&#8221; because they based the design on Forth or something, but I know nothing more about Forth than that it involves stacks, so that doesn&#8217;t help me.) </p>
<p>Specifically, Tracehydra means using Treehydra to translate the Spidermonkey (SM hereafter) C code that interprets each bytecode into C++ that emits equivalent (to the C) Tamarin IL. So I guess it reduces the problem from translating SM bytecode TT IL to translating SM C cases to TT IL-building code. Put that way, it&#8217;s not clear this actually helps, but I think SM bytecode is believed to have complex semantics that would be difficult to code in TT IL by hand, and maybe the C in SM has fewer constructs that are easier to translate. Seems possible, anyway.</p>
<p>If I&#8217;m gonna make any sense out of this I need to learn something about TT IL and how the TT VM uses it.</p>
<p><b>Digging in to Tamarin.</b> There doesn&#8217;t seem to be a lot of documentation on TT, so I thought I&#8217;d write up whatever I managed to figure out, for my own benefit at least. By the way, I&#8217;ve probably gotten some things wrong, so any TT experts are highly encouraged to correct me.</p>
<p>I should mention that Chris Double&#8217;s Tamarin posts have been invaluable-they are the best source I know of that explains how to actually build and run TT. And Graydon Hoare&#8217;s diagrams and comments are what got me started on having any idea where to look for anything in the code.</p>
<p>My first question was, what is the &#8220;life cycle&#8221; of a program run by TT? The TT shell as it exists today runs ActionScript programs, specifically ABC files (ActionScript bytecode). Once the tracer kicks in, TT is running machine code traces (e.g. x86-64 ISA). (Ugh. This is turning in alphabet soup.) What goes on in between? Here&#8217;s what I found.</p>
<p>By the way, this picture gives an overview. (Picture does not exist yet.)</p>
<p><b>Program Form 0: ActionScript. </b>I followed a sample program, the simplest program I could think of that will get traced (you need a hot loop), through TT. Here&#8217;s my program:</p>
<p>&nbsp; var sum = 0;<br />&nbsp; for (var i = 0; i &lt; 1000000; ++i) {<br />&nbsp;&nbsp;&nbsp;&nbsp; sum += i;&nbsp;&nbsp;&nbsp; <br />&nbsp; }<br />&nbsp; print(sum);</p>
<p>By the way, on my MacBook, this runs in 1.7s in interpreted mode (tracing disabled) and .28s with tracing enabled, a 6x speedup. And that includes VM startup time. For comparison, Java runs the equivalent program in .13s. (But Java gets the answer &#8220;wrong&#8221;, because ActionScript uses unlimited-precision integers while Java uses int32s. So this test is unfavorable toward TT.) Excluding startup time, Java takes about 4ms, apparently, the same as C. I&#8217;ll have to retest TT excluding startup time, once I figure out how.</p>
<p><b>Program Form 1: ABC. </b>This is the ActionScript bytecode and is the input to the current Tamarin shell. I don&#8217;t need to know too much about this, but the real Tamarin stuff is generated from here, so I peeked at the ABC for my program. ABC is apparently a stack-based bytecode language, much like Java bytecode. Here&#8217;s a snippet from my sample program, with my annotations after //:</p>
<p>&nbsp; // var sum = 0;<br />&nbsp; // sum was assigned local variable &#8217;slot 0&#8242; in ABC file headers<br />&nbsp; // Stack starts as: [stuff]<br />&nbsp; pushbyte 0<br />&nbsp; // Pushed a 0 value onto the stack. Stack is now: [stuff] 0<br />&nbsp; getglobalscope<br />&nbsp; // Pushed the global &#8216;this&#8217; onto the stack. Stack is now: [stuff] 0 this<br />&nbsp; swap<br />&nbsp; // Swapped top 2 elements of stack. Stack is now: [stuff] this 0<br />&nbsp; setslot 1<br />&nbsp; // Stored top of stack into slot 1 of next stack element. Stack is now: [stuff]</p>
<p>I&#8217;m not entirely sure how the compiler writers manage to get all swap, over, dup, pick3, etc. operators right, but the code itself is understandable enough.</p>
<p><b>Transformation 0-&gt;1: ActionScript-&gt;ABC.</b> This can be done externally to TT by the ActionScript Compiler, ASC, which is part of the Flex SDK.</p>
<p><b>Program Form 2: IL.</b> Tamarin IL is yet another stack-based bytecode. This is the IL that the VM executes directly when in interpreter mode. The basic operations include:</p>
<ul>
<li>Stack manipulators, such as DUP, DROP, OVER,</li>
<li>Arithmetic and logical operators, such as IADD,</li>
<li>Control flow operators, such as LBRT,</li>
<li>Object and variable storage operators, such as SETSLOTVALUE_I,</li>
<li>Interpreter internals operators, such as _debugenter, verify_x, and</li>
<li>Weird stuff like op_ROTNAME2_SWAP_ROT_SETRT.</li>
</ul>
<p>I think the weird stuff might be &#8220;superinstructions&#8221;, which are instructions that just implement a short sequence of basic instructions. Apparently they help interpreters run faster because by reducing decode overhead, like old CISC processors. </p>
<p>These instructions are defined in files named core/vm_*.h, e.g., vm_min_interp.h. These files are heavily macroized, which allows the actual meaning to be controlled by defining those macros in different ways, although I&#8217;m not sure exactly how this feature is used yet. Here&#8217;s IADD:</p>
<p>&nbsp; INTERP_FOPCODE_INTERP_BEGIN(IADD)<br />&nbsp;&nbsp;&nbsp; /* IADD None {&#8217;stktop&#8217;: 0} */<br />&nbsp;&nbsp;&nbsp; const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;<br />&nbsp;&nbsp;&nbsp; <br />&nbsp;&nbsp;&nbsp; INTERP_ADJUSTSP(-1)<br />&nbsp;&nbsp;&nbsp; sp[0].i = tmp_i_0;<br />&nbsp;&nbsp;&nbsp; INTERP_INVALBOXTYPE(sp[0])<br />&nbsp; INTERP_FOPCODE_INTERP_END(IADD)</p>
<p>As you can see, it adds the top two elements of the stack (sp) as integers, cuts the top element off the stack, and stores the sum in the new stack top. There&#8217;s also some junk about invalidating the box type, which seems to be some kind of debugging feature. When this is included in VMInterp.cpp, the begin and end macros will turn it into a switch case. From VMInterp.ii in my build:</p>
<p>&nbsp; foplabel_IADD: { pre_interp(interp, f, ip+-1, sp, rp);<br />&nbsp;&nbsp;&nbsp; const int32_t tmp_i_0 = int32_t(sp[-1].i) + int32_t(sp[0].i) ;<br />&nbsp;&nbsp;&nbsp; sp += (-1);<br />&nbsp;&nbsp;&nbsp; sp[0].i = tmp_i_0;<br />&nbsp;&nbsp;&nbsp; do { } while (0);<br />&nbsp; goto *k_foplabels_interp[*ip++]; }</p>
<p>pre_interp just prints out a trace of the instruction if the interpreter is in verbose mode and a bunch of other conditions are true. The &#8220;computed goto&#8221; at the end is an indirect jump to the case for the next instruction. This is some kind of optimization but I&#8217;ve never gotten a really convincing answer as to why it works, or if in fact it works, so I won&#8217;t go into it here. Processor experts, feel free to educate me.</p>
<p>The main control flow operators (i.e., used for if) seem to be LBRT and LBRF (local branch if true/false?). LBRT branches to a selected point in the IL sequence if the top of the stack is true, interpreted as a boolean. The target is specified as an offset from the address of the LBRT instruction. The target is given in the IL stream in the 2 16-bit units immediately following the LBRT code.</p>
<p>I also wanted to know what a function call looks like in IL, and it was surprisingly hard to figure out for some reason that I still don&#8217;t understand. But it looks like a standard call to a JS function is through an opcode w_callprop_only or w_callprop_argcok. These opcodes don&#8217;t seem to be defined in the usual vm_*_interp.h, but somehow they are made to branch to foplabel_TRACE_super_or_extern in VMInterp.ii. That code does the usual saving of a return pointer and setting the instruction pointer. Returning is accomplished with a less mysterious w_returnvalue or w_returnvoid opcode.</p>
<p>The type of the instructions is FOpcode, which is a 16-bit int. </p>
<p><b>Transformation 1-&gt;2: ABC-&gt;IL. </b>This is where Tamarin starts. This step is really important because some other system, like Tracehydra, that wants to use Tamarin, should basically do the same thing, except for some other form of input instead of ABC.</p>
<p>Tamarin performs this transformation on one method at a time, which is typical of JITs. The transformation is perfomed by Verifier::verify, which simultaneously verifies the ABC (checks for ill-formed ABC, like the Java bytecode verifier) and outputs Tamarin IL. The entry point to the verifier is apparently&nbsp; Interpreter::verify_x, which is just the implementation of an IL instruction also called verify_x.</p>
<p>That last part was probably pretty confusing. It surprised me, at least. It means the interpreter is already running IL by the time it actually translates and runs any ABC. I think what this means is that when the shell starts up, it starts the interpreter with a bit of boilerplate IL. The IL itself has code to verify and run the ABC program. </p>
<p>[Warning: compiler-expert-level material.] The verifier works as an abstract interpretation of the ABC. It uses the information gathered both to check for problems with the ABC and to guide IL generation. The state of the abstract interpretation is an abstraction of the ABC stack, just a list of types of objects on the stack, along with a lot of flags describing other conditions. </p>
<p>A standard abstract interpreter works as a fixed-point solver that possibly makes several iterations over the code. But TT&#8217;s verifier doesn&#8217;t work like that at all. The reason is that it requires that the states be equal at every join point, otherwise the ABC is invalid. So if the ABC is valid, then the interpreter never needs to look at a given instruction twice. And because backward branches in the ABC exist only for loops, this means the verifier can just run in a single forward pass through the ABC bytecode sequence. But it&#8217;s still really an abstract interpreter. Pretty neat. [End super-hard stuff.]</p>
<p>So the verifier runs over the ABC in sequence, always tracking the current abstract stack state. For each instruction, it generates IL. Part of generating the IL is generating any IL needed to fetch operands-the verifier can use the stack state to figure out where they are. After generating IL, the verifier simulates the effect of the instruction on the stack. This is all handled by a big switch on the ABC opcode, and each case has separate logic to generate IL and simulate the results.</p>
<p>The actual writing of the bits and bytes of IL is delegated to a class MethodWriter, which in turn delegates to ForthWriter (there&#8217;s that Forth stuff again). ForthWriter is a small class that maintains a buffer of IL bytecodes. It has a small API with methods like emit_simple(), which emits a simple IL instruction. MethodWriter is a wrapper used to generate Forth from ABC. The MethodWriter API takes ABC instructions as input and then tells the ForthWriter to emit the corresponding Forth bytecodes. For ABC opcodes with a direct IL equivalent, it looks up the equivalent in k_abc_opcode_map, which ultimately comes from vm_*_codepool.h. Otherwise it just has to work a little harder.</p>
<p>Time for a little example. I&#8217;ll use the same snippet I used above, with the IL translation trace straight out of the log file:</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frame: global * * global <br />&nbsp; 2:pushbyte 0<br />+000D LITC0<br />+000E w_ibox<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frame: global * * global int <br />&nbsp; 4:getglobalscope<br />+000F OVER<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frame: global * * global int global <br />&nbsp; 5:swap<br />+0010 SWAP<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frame: global * * global global int <br />&nbsp; 6:setslot 1<br />+0011 LITC4<br />+0012 w_ck_setslot_box<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frame: global * * global </p>
<p>I had no idea what any of this was the first time I looked at it. There are 3 kinds of lines. First, the &#8220;frame: &#8221; lines are the abstract interpreter&#8217;s stack state (&#8221;stack frame&#8221;) at each point. Second, the lines that start with numbers are ABC bytecodes. Third, the lines that start with &#8220;+&#8221; are the IL translation of the ABC code just above them.</p>
<p>The first bytecode:</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frame: global * * global <br />
&nbsp; 2:pushbyte 0<br />
+000D LITC0<br />
+000E w_ibox<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frame: global * * global int </p>
<p>&#8216;pushbyte 0&#8242; in ABC pushes the value 0 onto the stack. In IL, this is two steps: first, an opcode that pushes the literal C int value 0 (i.e., a machine word of 0 bits), then an opcode that boxes that C int into a Box. Box is Tamarin&#8217;s standard value type that can hold anything. To C, a Box looks like an IEEE-754 64-bit NaN. I guess any float that has an exponent (bits 52-62) of all 1s and a fraction that is not all 0s is a NaN, so there are really 2^53-1 different NaN values. Tamarin cleverly uses only one of those in floating-pointer computations so it can pack other data types, like 32-bit ints, into the 2^53-2 spare NaNs. A Box starting with the bit pattern 1111111111000 is an int, and the rightmost 32 bits contain the int data. See Box.h.</p>
<p>In our example, note how the verifier knows the top of the stack now holds an int.</p>
<p>The second bytecode:</p>
<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frame: global * * global int <br />
&nbsp; 4:getglobalscope<br />
+000F OVER<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; frame: global * * global int global </p>
<p>This is kind of interesting. An object-model-type ABC instruction, getglobalscope (pushes scope to use for unqualified name lookups), got translated into a stack manipulation IL instruction, OVER (pushes the value just under top of the stack). Apparently, the global &#8216;this&#8217; scope goes on the stack at the start of a method, and the interpreter records its position. If no other scopes have been entered (no with blocks), then the verifier can just emit an instruction to pick it from its current depth in the stack, since of course the interpreter also always knows the stack size. If there are other scopes in play, then the verifier emits a w_getouterscope instruction, which calls the interpreter C method Interpreter::getscopeobj, which goes through a few levels of direction but is ultimately pretty simple, grabbing a scope chain for the current method, and then picking off the first scope. Well, I really don&#8217;t know what that means, because I&#8217;m not too clear on scope chains yet, but it&#8217;s only a few lines of code, anyway.</p>
<p>That&#8217;s enough for now.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/05/16/tamarin-tracing-internals-part-i/feed/</wfw:commentRss>
		</item>
		<item>
		<title>ESP: MSR&#8217;s little helper</title>
		<link>http://blog.mozilla.com/dmandelin/2008/04/18/esp-msrs-little-helper/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/04/18/esp-msrs-little-helper/#comments</comments>
		<pubDate>Fri, 18 Apr 2008 22:45:12 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[esp]]></category>

		<category><![CDATA[outparams]]></category>

		<category><![CDATA[treehydra]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/2008/04/18/esp-msrs-little-helper/</guid>
		<description><![CDATA[The Javascript/Treehydra version of the outparam usage checker is finally nearing completion: all that&#8217;s left is packaging it as a patch that can go into mozilla-central (plus the inevitable future debugging). In my last post, I mentioned that the checker is based on ESP, an program analysis technique invented at Microsoft Research. A few people [...]]]></description>
			<content:encoded><![CDATA[<p>The Javascript/<a href="http://wiki.mozilla.org/Treehydra">Treehydra</a> version of the <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=420933">outparam usage checker</a> is finally nearing completion: all that&#8217;s left is packaging it as a patch that can go into mozilla-central (plus the inevitable future debugging). In my last post, I mentioned that the checker is based on <a href="http://www.google.com/search?q=ESP%3A+path-sensitive+program+verification+in+polynomial+time&amp;ie=utf-8&amp;oe=utf-8&amp;aq=t&amp;rls=org.mozilla:en-US:official&amp;client=firefox-a">ESP</a>, an program analysis technique invented at Microsoft Research. A few people have asked for a post about ESP (the paper is good, but very dense if you don&#8217;t have a PL research background), so here it is.</p>
<p><strong>Why ESP? </strong><br />
First I should explain why I bothered implementing a new outparam checker design given that I had a working version based on theorem proving. The problem was that that the theorem-proving version worked by analyzing &#8220;every&#8221; path in each method. Or it would have worked if it could analyze every path. But a method with N <code>if</code> statements can have 2^N paths, and N gets big enough that Mozilla has a method with 8 million paths. Worse, methods with loops have an infinite number of paths. In practice, path-based analyses have to give up after about 1000 paths, leaving the rest unanalyzed.</p>
<p>In short, path-based analysis is very precise, but lacks coverage of all the code paths. Conversely, the abstract interpretation approach I showed in my previous post does cover all code paths, but it mixes them up so much that it ends up being too imprecise to work at all.</p>
<p>When I saw this problem, I remembered ESP right away, because the whole point of ESP is to get the precision of path-based analysis with the speed and coverage of abstract interpretation. But after reviewing the paper, I couldn&#8217;t really see how to make ESP solve the problems I described before, so I went the theorem proving route. But once I got stuck on the path explosion problem, I went back to it, and eventually it hit me. Now it seems kind of obvious. So, it seems like I should be able to explain ESP and its application to outparams in a way that makes it sound simple, but that turned out to be hard. Hopefully it&#8217;s at least comprehensible.</p>
<p><strong>Abstract Interpetation Redux.</strong><br />
Previously, I tried out abstract interpretation with pen and paper and found that it didn&#8217;t even come close to working for outparams. (Reminder: abstract interpretation means running the code in a special interpreter that (a) tracks finite(-ish) <em>abstract states</em> instead of the standard program state, (b) goes both ways at branches and (c) merges state when control rejoins. This has the effect of running the method on every possible input value and every path in finite time. The price is that the output is abstract states instead of full detail.) Here are the results again (the table on the right shows the abstract state after abstractly interpreting each statement):</p>
<pre>
 1   nsresult SomeMethod(nsIX **out) {      out       rv   tmp   if.temp
 2     nsresult rv = doSomething();      not-written   ?
 3     tmp = rv;                         not-written   ?    ?
 4     if.temp = NS_SUCCEEDED(tmp)       not-written   ?    ?      ?
 5     if (if.temp) {                    not-written   ?    ?    true
 6       out = mValue;                       written   ?    ?    true
 7       return NS_OK;                       written   ?    ?    true
 8     } else {                          not-written   ?    ?    false
 9       return rv;                      not-written   ?    ?    false
10     }
11   }</pre>
<p>These analysis results are too imprecise to check the return on line 9: <code>rv</code> is unknown, so the analysis has to assume that the return value could be success, which is an error because <code>out</code> has not been written at this point. Note that the abstract interpretation <em>never</em> had any information about <code>rv</code>. Clearly, total ingorance about <code>rv</code> just won&#8217;t work, and any algorithm that works <em>must</em> track the relationship between <code>out</code> and <code>rv</code> that is created by line 2.</p>
<p><strong>A Smarter Abstract State Space.</strong><br />
Abstract interpetation can track that relationship, but it needs to use a more complicated abstract state than the one I implicitly used above. The abstract state in my table above is a mapping of variables to abstract values. (Compare with the real program state, which is a mapping of variable to C++ values.) That&#8217;s the simplest and most common abstract state, but there&#8217;s really nothing special about it. An abstract state can be any representation of a set of program states: the game is to choose an abstract state space that is &#8220;fine&#8221; enough to represent the information we need, but no finer, so the abstract states stay small and simple.</p>
<p>We need a state space that can represent facts like &#8220;<code>if.temp</code> is true iff <code>tmp</code> is a success code&#8221;. I can write that more explictly as, &#8220;<code>if.temp</code> is true and <code>tmp</code> is a success code, <em>or</em> <code>if.temp</code> is false and <code>tmp</code> is a failure code.&#8221; And that looks just like the &#8220;or&#8221; of two mappings of variables to abstract values. So, it looks like we can use an abstract state that&#8217;s just like our original state, except allowing <strong>multiple &#8220;table rows&#8221;</strong>. If we code the abstract interpreter to use multiple rows when it can, the results of abstract interpretation will come out like this (showing the states between the statements so it&#8217;s easier to separate the rows):</p>
<pre>
 1   nsresult SomeMethod(nsIX **out) {      out         rv    tmp   if.temp
                                         not-written
 2     nsresult rv = doSomething();
                                         not-written   succ
                                         not-written   fail
 3     tmp = rv;
                                         not-written   succ  succ
                                         not-written   fail  fail
 4     if.temp = NS_SUCCEEDED(tmp)
                                         not-written   succ  succ    true
                                         not-written   fail  fail    false
 5     if (if.temp) {
                                         not-written   succ  succ    true
 6       out = mValue;
                                             written   succ  succ    true
 7       return NS_OK;
 8     } else {
                                         not-written   fail  fail    false
 9       return rv;
10     }
11   }</pre>
<p>These results are detailed enough to check outparams perfectly!</p>
<p>A few things to note: In abstractly interpreting line 2, we don&#8217;t know the results exactly, but instead of generating a lot of &#8220;unknown&#8221; abstract values, we generate multiple rows, establishing the correlation among results. Now on lines 3 and 4, we have a multiple-row state, so we abstractly interpret the statements on each row independently. Finally, line 5 is a conditional guard, so at that point, we filter out all the rows that don&#8217;t match the guard (because the program wouldn&#8217;t execute this path in those states). Each of these features is another detail that has to be noticed and coded up in the analysis, but they all fit naturally into the framework of interpreting statements on abstract states.</p>
<p><strong>Path Sensitivity.</strong><br />
This version of the analysis is actually path-sensitive, because if different paths generate different states, those states will be kept as separate rows. Here&#8217;s an example:</p>
<pre>
nsresult OtherMethod(nsIX **out1, nsIX **out2) {
                                        out1          out2         rv    if.temp
                                    not-written   not-written
  nsresult rv = doSomething();
                                    not-written   not-written   success
                                    not-written   not-written   failure
  if.temp = NS_SUCCEEDED(rv);
                                    not-written   not-written   success   true
                                    not-written   not-written   failure   false
  if (if.temp) {
                                    not-written   not-written   success   true
    out1 = mFoo;
                                A:      written   not-written   success   true
  } else {
                                B:  not-written   not-written   failure   false
  }
                                C:  // Join point -- state is union of A and B.
                                        written   not-written   success   true
                                    not-written   not-written   failure   false
  doMoreStuff();
                                        written   not-written   success   true
                                    not-written   not-written   failure   false
  if (if.temp) {
                                        written   not-written   success   true
    out2 = mBar;
                                        written       written   success   true
  } else {
                                    not-written   not-written   failure   false
  }
                                     // Join point
                                        written       written   success   true
                                    not-written   not-written   failure   false
  return rv;
}</pre>
<p>It&#8217;s kind of hard to read, but the key point is that there are two <code>ifs</code> with the same guard, and to analyze the method correctly, we need to know that of the 4 possible paths, only 2 can actually be taken. State C is the important one: after finishing the first <code>if</code>, at the join point we merge the states by simply collecting all the rows. Each path has a different row, and the rows stay separate, so on the second <code>if</code>, the analysis executes the then branch only in the states generated by the first then branch.</p>
<p>This is actually the kind of thing the ESP authors were most concerned with in their paper. It&#8217;s pretty neat but the problems I had look very different, which is why it took me so long to see the connection.</p>
<p>A nice thing about this kind of path sensitivity is that if the state is the same along two branches, the rows will &#8220;rejoin&#8221; at the join point, essentially forgetting that there was a branch (because it didn&#8217;t really matter anyway). It also works with loops.</p>
<p>The problem is that although we don&#8217;t exactly get path explosion anymore, we can get &#8220;row explosion&#8221;: if there are M variables, and each has 2 possible abstract values, we can get 2^M rows in the state. And M can easily get big enough in Mozilla to run out of memory.</p>
<p><strong>ESP.</strong><br />
This is where ESP comes into play. The insight of ESP is that there are some variables you care about a lot (which the ESP authors call <em>property variables</em>), and others you care about only as far as they relate to the property variables (which the ESP authors call <em>execution variables</em>). (For example, in outparams, the property variables are the outparams and any variables that whose values can reach a return statement.) So, if there are only a few property variables, then if we had a way to track only the property values path-sensitively, we can be precise on the things we care about without row explosion.</p>
<p>ESP does this very simply: it just takes our multiple-row states and adds a  <strong>primary key</strong>, namely the set of property variables. Thus, property value combinations and relations are always tracked precisely. Execution variables are tracked as one mapping per property value combination, just as in the basic abstract interpretation. Because of primary key uniqueness, if there are K property variables, there can be no more than 2^K rows in a state, so if K is smaller than 10 or so, the states are small enough to analyze in reasonable time.</p>
<p>An ESP analysis looks a lot like our path-sensitive abstract intepretation, except that after each operation, it &#8220;collects&#8221; rows together to maintain the primary key uniqueness property. For example, if P is a property variable and E is an execution variable, and we need to merge this state:</p>
<pre>
    P = true,    E = false
    P = false,   E = false</pre>
<p>with this state:</p>
<pre>
    P = true,    E = true</pre>
<p>we take the union of rows as before to get this:</p>
<pre>
    P = true,    E = false
    P = false,   E = false
    P = true,    E = true</pre>
<p>but then we merge together rows with the same primary key, yielding:</p>
<pre>
    P = true,    E = anything
    P = false,   E = false</pre>
<p>The significance of ESP is for outparams is that all Mozilla methods have only a few outparams and return value variables, so the analysis runs fast no matter how many other &#8220;unimportant&#8221; variables are in the method.</p>
<p><strong>A small tweak.</strong><br />
Actually, that&#8217;s not quite true. GCC generates a temporary variable for each return statement, so if there are 30 return statements, there are 30 temporary variables, and the state can grow to 2^30 rows. That does happen, and it does make the analysis run out of memory.<br />
Fortunately, I was able to fix this with a just a small tweak to ESP. The temporary variables are only &#8220;live&#8221; between the point where they are created and where they are copied to another return variable, and their values don&#8217;t matter at all outside that live range. At any given point in the method, only a few temporaries are live. So I can keep the number of property values small by &#8220;demoting&#8221; return values to execution values once they are dead. And demotion is trivial to implement: just set the abstract value to any one value, because we&#8217;ll never read it anyway.</p>
<p>The whole outparam analysis came out to about 2500 lines of Javascript, but a lot of that was adapter code to simplify the Treehydra API, plus subsidiary analyses to find return value variables and their live ranges. The ESP framework was 450 lines, and the outparam abstract interpreter was another 800 lines. It runs in reasonable time too, without any effort optimizing it yet. I haven&#8217;t measured it exactly, but I think it&#8217;s less than 20 minutes on 1970 C++ files of Mozilla on a 4-processor machine. I guess you wouldn&#8217;t want to run it on every build, but if you&#8217;re only changing a few .cpp files, it shouldn&#8217;t be too bad.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/04/18/esp-msrs-little-helper/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Making Treehydra do useful tricks</title>
		<link>http://blog.mozilla.com/dmandelin/2008/04/01/making-treehydra-do-useful-tricks/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/04/01/making-treehydra-do-useful-tricks/#comments</comments>
		<pubDate>Wed, 02 Apr 2008 02:41:51 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[outparams]]></category>

		<category><![CDATA[treehydra]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/2008/04/01/making-treehydra-do-useful-tricks/</guid>
		<description><![CDATA[Taras&#8217; last blog post ended with a comment about &#8220;making [Treehydra] do useful tricks&#8221;, which oddly enough, is exactly what I&#8217;ve been working on, and I&#8217;ve finally made enough progress to blog about it. I&#8217;ve been alternating between implementing a Treehydra Javascript analysis library and adding needed features to Treehydra.
Just today, I managed to do [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://blog.mozilla.com/tglek/2008/03/17/dehydra-world-tour/">Taras&#8217; last blog post</a> ended with a comment about &#8220;making [Treehydra] do useful tricks&#8221;, which oddly enough, is exactly what I&#8217;ve been working on, and I&#8217;ve finally made enough progress to blog about it. I&#8217;ve been alternating between implementing a <a href="http://wiki.mozilla.org/Treehydra">Treehydra</a> Javascript analysis library and <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=423896">adding</a> <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=425034">needed</a> <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=425794">feat</a><a href="https://bugzilla.mozilla.org/show_bug.cgi?id=425846">ures</a> to Treehydra.</p>
<p>Just today, I managed to do an intraprocedural live variable analysis, which is one of the simplest program analyses, on every file in mozilla-central. (Live variable analysis determines the set of variables that may be read in the future at every point in a function. It&#8217;s commonly used in optimization to save storage for unused variables, but I use it to make checkers &#8220;forget&#8221; information about unused variables.) <a href="http://people.mozilla.com/~dmandelin/live_main.svg">Here&#8217;s a visualization</a> of the results for Firefox&#8217;s main() function in a Linux build: the set of live variables is listed at the bottom of each basic block.</p>
<p>It took 25-30 minutes to run on all of Mozilla (as preprocessed C++), but I know a lot of that is simply GCC compile time, and I think a fair fraction of the rest was spent generating the visualizations, which most analyses won&#8217;t do. I guess I need to investigate how to time JS execution internally.</p>
<p>My Javascript analysis library is about 900 lines of code, with modules for Treehydra utilities, GCC data access, GCC value printing, data structures needed for analysis, backward data-flow analysis. I hope these will be reused for other analyses&#8211;there are fewer than 100 lines of code specific to liveness analysis. <a href="http://hg.mozilla.org/users/dmandelin_mozilla.com/treehydra-analysis/">The code is available here.</a></p>
<p>The next step will be to finish the <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=420933">outparam analysis</a>. Hopefully, it won&#8217;t be too hard. The big pieces are:</p>
<ul>
<li>An analysis to determine which variables may reach the return statement of the function (the technique is similar to the liveness analysis).</li>
<li>Port over my <a href="http://www.google.com/search?q=ESP%3A+path-sensitive+program+verification+in+polynomial+time&amp;ie=utf-8&amp;oe=utf-8&amp;aq=t&amp;rls=org.mozilla:en-US:official&amp;client=firefox-a">ESP</a> analysis framework from Python.</li>
<li>Implement the outparam checker in the ESP framework.</li>
</ul>
<p>I prototyped all of it in Python, so I know the algorithms work, and I&#8217;ve ported much of it over to Treehydra/JS for the liveness demo, so I know it codes up nicely there as well. I&#8217;m sure there will be glitches to fix, and I&#8217;m sure I made some mistakes in designing my Javascript framework, but I&#8217;ll just have to see how it goes.</p>
<p>Finally, I have to mention that I&#8217;ve upgraded my Javascript skills quite a bit in the process of doing this (it&#8217;s the most complex JS program I&#8217;ve written, and I&#8217;ve also been using <a href="http://developer.mozilla.org/en/docs/JSAPI_Reference">JSAPI</a>), and it&#8217;s all thanks to the <a href="http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide">MDC Javascript Guide</a>, which has been an excellent resource.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/04/01/making-treehydra-do-useful-tricks/feed/</wfw:commentRss>
		</item>
		<item>
		<title>I need a theorem prover!</title>
		<link>http://blog.mozilla.com/dmandelin/2008/03/10/i-need-a-theorem-prover/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/03/10/i-need-a-theorem-prover/#comments</comments>
		<pubDate>Tue, 11 Mar 2008 00:05:41 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/2008/03/10/i-need-a-theorem-prover/</guid>
		<description><![CDATA[Time for a hardcore static analysis post. By the way, if anyone reading this knows about theorem provers, I could really use your help: what&#8217;s a solid off-the-shelf prover that will solve my formulas (see below)?
I&#8217;m working on bug 420933, which is a request for a static checker for XPCOM outparam usage. In short, XPCOM [...]]]></description>
			<content:encoded><![CDATA[<p>Time for a hardcore static analysis post. By the way, if anyone reading this knows about theorem provers, I could really use your help: what&#8217;s a solid off-the-shelf prover that will solve my formulas (see below)?</p>
<p>I&#8217;m working on <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=420933">bug 420933</a>, which is a request for a static checker for XPCOM outparam usage. In short, XPCOM methods with outparams must (a) not set outparams when they return a failure code, and (b) set all outparams when they return a success code. At first, checking this sounded easy, because I was thinking of code like this:</p>
<pre>
nsresult SomeMethod(nsIWhatever **out) {
  nsresult rv = doSomething();
  if (NS_SUCCEEDED(rv)) {
    out = mValue;
    return NS_OK;
  } else {
    return rv;
  }
}
</pre>
<p><strong>Attempt #1.</strong> We can check this method with a simple static analysis algorithm. The first step is to trace through all the code paths, recording (i) any outparam that gets assigned to, and (ii) (an abstract approximation of) values of variables that can be returned from the method. In this case, the results are:</p>
<pre>
nsresult SomeMethod(nsIWhatever **out) {
  nsresult rv = doSomething();   // rv = ?,       out = NOT WRITTEN
  if (NS_SUCCEEDED(rv)) {
                                 // rv = SUCCESS, out = NOT_WRITTEN
    out = mValue;                // rv = SUCCESS, out = WRITTEN
    return NS_OK;                // rv = SUCCESS, out = WRITTEN
  } else {
                                 // rv = FAILURE, out = NOT_WRITTEN
    return rv;                   // rv = FAILURE, out = NOT_WRITTEN
  }
}
</pre>
<p>Note that on each branch of the <code>if</code> statement the analysis sets a property according to whether the true or false branch was taken. Some analyses don&#8217;t do this, but in our case we know that <code>return rv</code> returns false only by taking account of the branch condition.</p>
<p>The second step is to check the requirement at each return statement using the results from step 1. So, for <code>return rv</code>, first we read off <code>rv = FAILURE</code>, which tells us to check property (b), that all outparams are set. Then we see <code>out = NOT_WRITTEN</code>, so the check succeeds.</p>
<p>I liked this design because it was fairly simple, and also because it can be implemented as a standard flow-sensitive abstract interpretation, which is pretty efficient. (When I say &#8220;standard flow-sensitive abstract interpretation&#8221; I mean the design kind of looks like a combination of two standard compiler passes: the constant propagation optimization (for the return values) and the unassigned variables check (for the outparams).)</p>
<p><strong>&#8220;For any complex problem, there is always a solution that is<br />
simple, clear, and wrong.&#8221;</strong> Unfortunately, attempt #1 doesn&#8217;t work, for two reasons. First, GCC adds stuff to the code before the analysis sees it, so the method above looks more like this:</p>
<pre>
nsresult SomeMethod(nsIWhatever **out) {
  nsresult rv;
  rv = doSomething();            // rv = ?
  tmp = rv;                      // rv = ?, tmp = ?
  if.temp = NS_SUCCEEDED(tmp);   // rv = ?, tmp = ?, if.temp = ?
  if (if.temp) {
                                 // rv = ?, tmp = ?, if.temp = SUCCESS
    out = mValue;                // rv = ?, tmp = ?, if.temp = SUCCESS
    return NS_OK;                // rv = ?, tmp = ?, if.temp = SUCCESS
  } else {
                                 // rv = ?, tmp = ?, if.temp = SUCCESS
    return rv;                   // rv = ?, tmp = ?, if.temp = FAILURE
  }
}
</pre>
<p>The analysis tracks <code>if.temp</code> through the if branches, but fails to pick up any information about <code>rv</code>. So when we reach <code>return rv</code>, we don&#8217;t know whether to check (a) or (b), and we can&#8217;t check the method&#x2014;we can only issue a (spurious) warning. In order to make this work, we need to track not only so much values of return variables as their relationships. We can represent the relationships as logical formulas:</p>
<pre>
nsresult SomeMethod(nsIWhatever **out) {
  nsresult rv;
  rv = doSomething();            // (empty formula)
  tmp = rv;                      // tmp == rv
  if.temp = NS_SUCCEEDED(rv);    // tmp == rv, if.temp  SUCCEEDED(rv)
  if (if.temp) {
                      // tmp == rv, if.temp  SUCCEEDED(tmp), if.temp
    out = mValue;     // tmp == rv, if.temp  SUCCEEDED(tmp), if.temp
    return NS_OK;     // tmp == rv, if.temp  SUCCEEDED(tmp), if.temp
  } else {
                      // tmp == rv, if.temp  SUCCEEDED(tmp), not if.temp
    return rv;        // tmp == rv, if.temp  SUCCEEDED(tmp), not if.temp
  }
}
</pre>
<p>Now, when we reach <code>return rv</code>, we have enough information to figure out that <code>rv</code> is a failure code, assuming we have a theorem prover that can reason like this:</p>
<pre>
if.temp  SUCCEEDED(tmp)               ===&gt;  not if.temp  FAILED(tmp)
not if.temp, not if.temp =&gt; FAILED(tmp)  ===&gt;  FAILED(tmp)
FAILED(tmp), tmp == rv                   ===&gt;  FAILED(rv)
</pre>
<p>I said there was a second reason the simple analysis doesn&#8217;t work, which is code like this:</p>
<pre>
nsresult AnotherMethod(nsIFoo **out) {
  nsresult rv = mBar.DelegateMethod(out); // rv = ?, out = ?
  return rv;
}
</pre>
<p>The analysis should not consider this an error: it should assume that <code>DelegateMethod</code> follows the outparam protocol correctly, which means that that <code>rv</code> is a success code iff <code>out</code> was written, and the requirement is satisifed. But the basic analysis gets no information about both <code>rv</code> and <code>out</code>. Again, we need to track relationships. This time, we want to model the method call using the formula <code>SUCCEEDED(rv)  WRITTEN(out)</code>. When we reach <code>return rv</code>, we will see that we don&#8217;t have a definite value for <code>rv</code>. So we check both properties (a) and (b).  First, we ask the theorem prover which outparams are written under the assumption <code>SUCCEEDED(rv)</code> and check (a). Then we do it again with <code>FAILED(rv)</code> and check (b).</p>
<p><strong>Dumbest Theorem Prover Evar.</strong> At this point, our problem is pretty much solved (minus inordinate effort parsing XPIDL du