<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>David Mandelin&#039;s blog &#187; SquirrelFish</title>
	<atom:link href="http://blog.mozilla.com/dmandelin/category/squirrelfish/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.mozilla.com/dmandelin</link>
	<description>Just another Blog.mozilla.com weblog</description>
	<lastBuildDate>Wed, 25 Jan 2012 18:02:04 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Squirrelfishing regexp-dna.js</title>
		<link>http://blog.mozilla.com/dmandelin/2008/10/06/squirrelfishing-in-regexp-dnajs/</link>
		<comments>http://blog.mozilla.com/dmandelin/2008/10/06/squirrelfishing-in-regexp-dnajs/#comments</comments>
		<pubDate>Mon, 06 Oct 2008 21:56:38 +0000</pubDate>
		<dc:creator>dmandelin</dc:creator>
				<category><![CDATA[SquirrelFish]]></category>
		<category><![CDATA[js]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[sfx]]></category>
		<category><![CDATA[spidermonkey]]></category>
		<category><![CDATA[string]]></category>

		<guid isPermaLink="false">http://blog.mozilla.com/dmandelin/?p=29</guid>
		<description><![CDATA[Recently I&#8217;ve been trying to figure out exactly how SquirrelFish Extreme (SFX) is kicking our butts so badly on regexp-dna.js, by about 5x on my machine. Numerically, (WebKit Regular Expression Compiler) WREC provides most of that 5x, but there are some weird twists to the story. My main conclusions: WREC indeed makes for a good [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I&#8217;ve been trying to figure out exactly how <a href="http://webkit.org/blog/214/introducing-squirrelfish-extreme/">SquirrelFish Extreme</a> (SFX) is kicking our butts so badly on regexp-dna.js, by about 5x on my machine. Numerically, (WebKit Regular Expression Compiler) WREC provides most of that 5x, but there are some weird twists to the story. My main conclusions: <strong>WREC indeed makes for a good regex engine</strong>, but also, <strong>WREC bails out of any regex with parentheses</strong>, and <strong>regexp-dna.js contains a bug that favors SFX over SpiderMonkey</strong>.</p>
<p><strong>Update:</strong> The <a href="https://bugs.webkit.org/show_bug.cgi?id=18989">bug in regexp-dna.js</a> has been fixed in WebKit! Thanks, WebKit team! That was fast. The new code:</p>
<p>for(k in subs)<br />
dnaInput = dnaInput.replace(k, subs[k])<br />
// FIXME: Would like this to be a global substitution in a future version of SunSpider.</p>
<p><strong>Technical details: the design of WREC.</strong> There are two main ways to implement regular expressions: using a backtracking matching engine, or by <a href="http://swtch.com/~rsc/regexp/regexp1.html">transforming the regex to a finite automaton</a> (NFA, aka &#8220;state machine&#8221;), which does not backtrack. Most Perl-type regex engines, including both SpiderMonkey&#8217;s and WREC, follow the backtracking design. I don&#8217;t know the exact history of that choice, but at present it is much easier to implement features like group capture and backreferences in the backtracking design. Also, although <a href="http://swtch.com/~rsc/regexp/regexp1.html">some regexes scale only if implemented as NFAs</a>, my tests suggest that many simple regexes, including those in SunSpider, are faster with backtracking.</p>
<p>As of this writing, WREC&#8217;s implementation strategy is dirt simple (which is a good thing). There are no transformations or fancy optimizations on the regex. WREC simply generates native code that directly implements the backtracking search. Thus, within a single match operation, there are no function calls, no traversals of regular expression ASTs, and few option tests, so almost all of the overhead is eliminated.</p>
<p>WREC&#8217;s code is very easy to read, so if you want to know exactly how it works, just read it in WREC.cpp. It&#8217;s also great example code for anyone implementing a compiler for a simple language like regular expressions. The basic plan is to parse the regular expression with functions named things like parseDisjunction (the | operator). Those functions directly call functions like generateDisjunction that generate the native code using the same assembler that the call-threading interpreter uses. There&#8217;s also the oddly named &#8220;gererateParenthesesResetTrampoline&#8221;. Inexplicably preserved typo, or watermark to detect copying of WREC code?</p>
<p>An an example, for the regex /a|bb/, the generated native code looks something like this:</p>
<p>curPos = start<br />
if curPos &gt;= textLength goto CASE2<br />
if text[curPos] != &#8216;a&#8217; goto CASE2<br />
curPos += 1<br />
goto DONE_MATCHED<br />
CASE2:<br />
curPos = start<br />
if curPos &gt;= textLength goto DONE_FAIL<br />
if text[curPos] != &#8216;b&#8217; goto DONE_FAIL<br />
curPos += 1<br />
if curPos &gt;= textLength goto DONE_FAIL<br />
if text[curPos] != &#8216;b&#8217; goto DONE_FAIL<br />
curPos += 1<br />
goto DONE_MATCHED</p>
<p>As you can see, the generated code does almost nothing beyond what is absolutely needed to implement the matching correctly. Also, curPos, textLength, and the start address of text are all kept in fixed registers. Based on a variety of microbenchmarks, I suspect that the critical path for this code is simply the unavoidable work of reading all the characters in the text, so the code is near maximum efficiency.</p>
<p><strong>Backtracking in WREC. </strong>One tricky part of a backtracking regex engine is the backtracking. Consider matching the regex /a*a/ against the text &#8216;a&#8217;. First, the engine will greedily match /a*/ with &#8216;a&#8217;, leaving a remaining regex fragment of /a/ and a remaining text of the empty string. When the engine tries to match /a/ against &#8221;, it fails. Now it must backtrack, specifically by going back to the /a*/ step and finding the next longest match. The next longest match is &#8221;, leaving a remaining regex fragment of /a/ and a remaining text of &#8216;a&#8217;, and the engine will finish with a successful match.</p>
<p>I was curious how WREC implemented backtracking. For example, with /a*a/ and a text of a million &#8216;a&#8217;s, I wanted to know if WREC saves all one million shorter matches in case it needed to backtrack them (which seems like a waste of space), or if WREC recomputes them if needed (which could be slow if a lot of backtracking needs to be done). I found that WREC generates code like this for a quantified regex a* (I reordered things to make it easier to read, but the logic is the same):</p>
<p>// First, greedy match<br />
repeatCount = 0<br />
while (next char is &#8216;a&#8217;) {<br />
curPos += 1<br />
repeatCount += 1<br />
}<br />
while (true) {<br />
// Now, try to match the rest<br />
save curPos<br />
&#8230; code to match the rest of the regex &#8230;<br />
if (rest of regex matches) goto DONE_MATCHED<br />
<strong> // backtrack<br />
restore curPos<br />
if curPos &lt;= 0 goto DONE_FAILED<br />
curPos -= 1<br />
repeatCount -= 1<br />
// loop around and try to match the rest from here</strong><br />
}</p>
<p>So the answer to my question is that WREC doesn&#8217;t save the intermediate matches, it just keeps track of the length of the text matched against /a*/ (repeatCount in my code) and decrements the position and count in order to backtrack. This is great, because it uses only two words of memory for state (instead of N million) but can find the next longest match very quickly (just 4 instructions).</p>
<p>But this left me wondering, how does WREC handle something like /(a|bb)*x/, where backtracking needs to go back a variable number of characters? Here&#8217;s the code that replaces &#8216;repeatCount-1&#8242; with the backtracking logic needed here:</p>
<p><strong>void GenerateParenthesesNonGreedyFunctor::backtrack(WRECGenerator*) {<br />
// FIXME: do something about this.<br />
CRASH();<br />
}</strong></p>
<p>Interesting. This function isn&#8217;t called anywhere, so it&#8217;s not a crashing bug in WebKit. The reason it&#8217;s not called:</p>
<p><strong>bool WRECParser::parseParentheses(JmpSrcVector&amp;) {<br />
// FIXME: We don&#8217;t currently backtrack correctly within parentheses in cases such as<br />
// &#8220;c&#8221;.match(/(.*)c/) so we fall back to PCRE for any regexp containing parentheses.<br />
m_err = TempError_unsupportedParentheses;<br />
return false;<br />
}<br />
</strong></p>
<p>Very interesting. And on my machine, dna-regexp.js runs in about 43 ms (in jsc, the SFX command-line shell), but if I add an empty pair of parens to each regex in the test, it takes 206 ms. Breaking out the time for regexp-dna.js regex matching only (i.e., not including the string building and replace calls), with the parens SFX is 15% slower than SpiderMonkey&#8217;s, which is actually still pretty good.</p>
<p>It just so happens that SunSpider&#8217;s regexp-dna.js doesn&#8217;t use parens in any of its regexes.</p>
<p>But the web does, of course. To find out how common parens are, I instrumented a Firefox build to record regexes being run on web pages and opened some popular pages. Firefox processed 830 unique regexes, of which 287, or just about 1/3, contained parens.</p>
<p><strong>String.replace.</strong> The other major portion of regexp-dna.js tests String.replace. Again, SFX is about 5x as fast as SpiderMonkey and I wanted to know why. The calls to replace in regexp-dna all look like this:</p>
<p>dnaInput.replace(&#8216;B&#8217;, &#8216;(c|g|t)&#8217;, &#8216;g&#8217;)</p>
<p>This is supposed to replace all occurrences of &#8216;B&#8217; with &#8216;(c|g|t)&#8217;. The search string is just &#8216;B&#8217;, so no regex processing is needed, just simple string searching. So I started reading WebKit code in StringPrototype.cpp. I saw that there is a special case fast-path for a non-regex search string, which makes sense. In fact, it&#8217;s so special and so fast that I couldn&#8217;t even find any code to implement the &#8216;$&amp;&#8217; replacement string syntax required by the ECMAScript standard. Testing SFX:</p>
<p>&gt; &#8216;aba&#8217;.replace(/a/, &#8216;x$&amp;x&#8217;)<br />
xaxba // good: $&amp; is the matched text<br />
&gt; &#8216;aba&#8217;.replace(&#8216;a&#8217;, &#8216;x$&amp;x&#8217;)<br />
x$&amp;xba // oops</p>
<p>I also couldn&#8217;t find out how the &#8216;g&#8217; flag was implemented, or even any code to read the third &#8220;flags&#8221; argument at all. Testing SFX again:</p>
<p>&gt; &#8216;aba&#8217;.replace(&#8216;a&#8217;, &#8216;x&#8217;, &#8216;g&#8217;)<br />
xba</p>
<p>Hmmm. So the &#8216;g&#8217; flag is not implemented at all. That&#8217;s OK, it&#8217;s not part of the ECMAScript standard, it&#8217;s just a SpiderMonkey extension. But that means the benchmark contains a flag that just serves to make SpiderMonkey do a lot more work than SquirrelFish. With that flag removed, so that both programs are doing the same replacement operation, the performance difference goes from 5x to 1.5x. Hopefully SpiderMonkey will have a <a href="https://bugzilla.mozilla.org/show_bug.cgi?id=432525">similar fast path</a> soon, which should bring that to parity.</p>
<p>But I think it&#8217;s pretty clear regexp-dna.js has a bug. Either the &#8216;g&#8217; should be removed, or it should be recoded to do the global replacement in vanilla ECMAScript.</p>
<p><strong>Final words.</strong> Based on its performance on the regexes it does handle, WREC is indeed an awesome design. regexp-dna.js, however, is flawed and exaggerates SFX performance.</p>
<p>We could use nanojit to make a regex compiler for SpiderMonkey that would perform as well as WREC. But I don&#8217;t know if it&#8217;s worthwhile yet. Regex performance is much less important for today&#8217;s web than it is for SunSpider&#8211;I hope to link to a report on that in a future post.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.mozilla.com/dmandelin/2008/10/06/squirrelfishing-in-regexp-dnajs/feed/</wfw:commentRss>
		<slash:comments>46</slash:comments>
		</item>
	</channel>
</rss>

