Main menu:

Site search

Categories

Archive

Squirrelfishing regexp-dna.js

Recently I’ve been trying to figure out exactly how SquirrelFish Extreme (SFX) is kicking our butts so badly on regexp-dna.js, by about 5x on my machine. Numerically, (WebKit Regular Expression Compiler) WREC provides most of that 5x, but there are some weird twists to the story. My main conclusions: WREC indeed makes for a good regex engine, but also, WREC bails out of any regex with parentheses, and regexp-dna.js contains a bug that favors SFX over SpiderMonkey.

Update: The bug in regexp-dna.js has been fixed in WebKit! Thanks, WebKit team! That was fast. The new code:

for(k in subs)
dnaInput = dnaInput.replace(k, subs[k])
// FIXME: Would like this to be a global substitution in a future version of SunSpider.

Technical details: the design of WREC. There are two main ways to implement regular expressions: using a backtracking matching engine, or by transforming the regex to a finite automaton (NFA, aka “state machine”), which does not backtrack. Most Perl-type regex engines, including both SpiderMonkey’s and WREC, follow the backtracking design. I don’t know the exact history of that choice, but at present it is much easier to implement features like group capture and backreferences in the backtracking design. Also, although some regexes scale only if implemented as NFAs, my tests suggest that many simple regexes, including those in SunSpider, are faster with backtracking.

As of this writing, WREC’s implementation strategy is dirt simple (which is a good thing). There are no transformations or fancy optimizations on the regex. WREC simply generates native code that directly implements the backtracking search. Thus, within a single match operation, there are no function calls, no traversals of regular expression ASTs, and few option tests, so almost all of the overhead is eliminated.

WREC’s code is very easy to read, so if you want to know exactly how it works, just read it in WREC.cpp. It’s also great example code for anyone implementing a compiler for a simple language like regular expressions. The basic plan is to parse the regular expression with functions named things like parseDisjunction (the | operator). Those functions directly call functions like generateDisjunction that generate the native code using the same assembler that the call-threading interpreter uses. There’s also the oddly named “gererateParenthesesResetTrampoline”. Inexplicably preserved typo, or watermark to detect copying of WREC code?

An an example, for the regex /a|bb/, the generated native code looks something like this:

curPos = start
if curPos >= textLength goto CASE2
if text[curPos] != ‘a’ goto CASE2
curPos += 1
goto DONE_MATCHED
CASE2:
curPos = start
if curPos >= textLength goto DONE_FAIL
if text[curPos] != ‘b’ goto DONE_FAIL
curPos += 1
if curPos >= textLength goto DONE_FAIL
if text[curPos] != ‘b’ goto DONE_FAIL
curPos += 1
goto DONE_MATCHED

As you can see, the generated code does almost nothing beyond what is absolutely needed to implement the matching correctly. Also, curPos, textLength, and the start address of text are all kept in fixed registers. Based on a variety of microbenchmarks, I suspect that the critical path for this code is simply the unavoidable work of reading all the characters in the text, so the code is near maximum efficiency.

Backtracking in WREC. One tricky part of a backtracking regex engine is the backtracking. Consider matching the regex /a*a/ against the text ‘a’. First, the engine will greedily match /a*/ with ‘a’, leaving a remaining regex fragment of /a/ and a remaining text of the empty string. When the engine tries to match /a/ against ”, it fails. Now it must backtrack, specifically by going back to the /a*/ step and finding the next longest match. The next longest match is ”, leaving a remaining regex fragment of /a/ and a remaining text of ‘a’, and the engine will finish with a successful match.

I was curious how WREC implemented backtracking. For example, with /a*a/ and a text of a million ‘a’s, I wanted to know if WREC saves all one million shorter matches in case it needed to backtrack them (which seems like a waste of space), or if WREC recomputes them if needed (which could be slow if a lot of backtracking needs to be done). I found that WREC generates code like this for a quantified regex a* (I reordered things to make it easier to read, but the logic is the same):

// First, greedy match
repeatCount = 0
while (next char is ‘a’) {
curPos += 1
repeatCount += 1
}
while (true) {
// Now, try to match the rest
save curPos
… code to match the rest of the regex …
if (rest of regex matches) goto DONE_MATCHED
// backtrack
restore curPos
if curPos <= 0 goto DONE_FAILED
curPos -= 1
repeatCount -= 1
// loop around and try to match the rest from here

}

So the answer to my question is that WREC doesn’t save the intermediate matches, it just keeps track of the length of the text matched against /a*/ (repeatCount in my code) and decrements the position and count in order to backtrack. This is great, because it uses only two words of memory for state (instead of N million) but can find the next longest match very quickly (just 4 instructions).

But this left me wondering, how does WREC handle something like /(a|bb)*x/, where backtracking needs to go back a variable number of characters? Here’s the code that replaces ‘repeatCount-1′ with the backtracking logic needed here:

void GenerateParenthesesNonGreedyFunctor::backtrack(WRECGenerator*) {
// FIXME: do something about this.
CRASH();
}

Interesting. This function isn’t called anywhere, so it’s not a crashing bug in WebKit. The reason it’s not called:

bool WRECParser::parseParentheses(JmpSrcVector&) {
// FIXME: We don’t currently backtrack correctly within parentheses in cases such as
// “c”.match(/(.*)c/) so we fall back to PCRE for any regexp containing parentheses.
m_err = TempError_unsupportedParentheses;
return false;
}

Very interesting. And on my machine, dna-regexp.js runs in about 43 ms (in jsc, the SFX command-line shell), but if I add an empty pair of parens to each regex in the test, it takes 206 ms. Breaking out the time for regexp-dna.js regex matching only (i.e., not including the string building and replace calls), with the parens SFX is 15% slower than SpiderMonkey’s, which is actually still pretty good.

It just so happens that SunSpider’s regexp-dna.js doesn’t use parens in any of its regexes.

But the web does, of course. To find out how common parens are, I instrumented a Firefox build to record regexes being run on web pages and opened some popular pages. Firefox processed 830 unique regexes, of which 287, or just about 1/3, contained parens.

String.replace. The other major portion of regexp-dna.js tests String.replace. Again, SFX is about 5x as fast as SpiderMonkey and I wanted to know why. The calls to replace in regexp-dna all look like this:

dnaInput.replace(’B', ‘(c|g|t)’, ‘g’)

This is supposed to replace all occurrences of ‘B’ with ‘(c|g|t)’. The search string is just ‘B’, so no regex processing is needed, just simple string searching. So I started reading WebKit code in StringPrototype.cpp. I saw that there is a special case fast-path for a non-regex search string, which makes sense. In fact, it’s so special and so fast that I couldn’t even find any code to implement the ‘$&’ replacement string syntax required by the ECMAScript standard. Testing SFX:

> ‘aba’.replace(/a/, ‘x$&x’)
xaxba // good: $& is the matched text
> ‘aba’.replace(’a', ‘x$&x’)
x$&xba // oops

I also couldn’t find out how the ‘g’ flag was implemented, or even any code to read the third “flags” argument at all. Testing SFX again:

> ‘aba’.replace(’a', ‘x’, ‘g’)
xba

Hmmm. So the ‘g’ flag is not implemented at all. That’s OK, it’s not part of the ECMAScript standard, it’s just a SpiderMonkey extension. But that means the benchmark contains a flag that just serves to make SpiderMonkey do a lot more work than SquirrelFish. With that flag removed, so that both programs are doing the same replacement operation, the performance difference goes from 5x to 1.5x. Hopefully SpiderMonkey will have a similar fast path soon, which should bring that to parity.

But I think it’s pretty clear regexp-dna.js has a bug. Either the ‘g’ should be removed, or it should be recoded to do the global replacement in vanilla ECMAScript.

Final words. Based on its performance on the regexes it does handle, WREC is indeed an awesome design. regexp-dna.js, however, is flawed and exaggerates SFX performance.

We could use nanojit to make a regex compiler for SpiderMonkey that would perform as well as WREC. But I don’t know if it’s worthwhile yet. Regex performance is much less important for today’s web than it is for SunSpider–I hope to link to a report on that in a future post.

Comments

Comment from Felix
Time: October 6, 2008, 7:27 pm

regexp-dna.js bug report
https://bugs.webkit.org/show_bug.cgi?id=18989

Comment from Bill McCloskey
Time: October 6, 2008, 7:27 pm

Since you posted a link to the Russ Cox article, I thought I’d point out something that I learned from Ras a while ago: the backtracking approach doesn’t use longest-match semantics as the NFA approach does. So the two techniques differ in more than their performance properties. Here’s an example, written in Python:
import re
re.match(’(|a)*’, ‘aaaaa’).span() # returns (0, 0)
I would expect the whole string to match, but instead only the epsilon at the beginning of the string matches. This seems really weird to me, but it’s how all backtracking-based matchers work: they pick the “left-most match” rather than the longest one.

It’s too bad that getting NFA-based matchers to do grouping is such a pain in the ass.

Comment from Mark Rowe
Time: October 6, 2008, 8:17 pm

> In fact, it’s so special and so fast that I couldn’t even find any code to implement the ‘$&’ replacement string syntax required by the ECMAScript standard.

Did you file a bug report about this?

Comment from Robert O’Callahan
Time: October 6, 2008, 8:32 pm

Make sure to file a Webkit bug!

Comment from Jeremy
Time: October 6, 2008, 8:48 pm

Minor correction: backtracking regular expression engines *are* NFA engines. What you refer to as NFA implementations above are actually using DFAs, not NFAs.

Same goes for Mr. McCloskey’s comment.

Comment from Cory Sharp
Time: October 6, 2008, 10:17 pm

Bill, you are just observing an abort when matching a repetition of epsilon. Since the epsilon _does_ match then ‘a’ isn’t tried, but epsilon does not advance so the match is complete. Observe re.match(r”(a|b)*”,”aabba”).span() returns (0, 5).

Comment from Cory Sharp
Time: October 6, 2008, 10:25 pm

Bill, my mistake, nevermind.

Pingback from Regex Engines in JavaScript – A River of Words
Time: October 6, 2008, 11:31 pm

[...] led me to this post by what appears to be a member of Google’s Chrome team, regarding the Squirrelfish [...]

Comment from Gijs
Time: October 7, 2008, 12:27 am

Bill: really? Because in SpiderMonkey (which according to the article also uses backtracking) I obtain this:

>> (”aaaaa”).match(/(|a)*/);
aaaaa,a

Which seems like what you said you expected…

Comment from Chris Jones
Time: October 7, 2008, 12:29 am

Joel Galenson, Ras Bodik and I came across the /a*a/ example you showed. If you follow Russ’s technique and extend it to ‘n’ /a*/’s followed by ‘n’ /a/’s, the backtracking algorithm takes exponential-squared (!) time matching ‘n’ a’s, worse than Russ’s example. It’s pretty amusing to stall Firefox with a 15-character regexp match (not that only Firefox is so affected).

RE Bill’s comment: making NFA algorithms group is not so bad, but making the returned groups useful and understandable is harder. We started looking into this, but the “future post” Dave mentioned could describe why we might stop ;-) .

Comment from Chris Kuklewicz
Time: October 7, 2008, 1:02 am

Bill is right. There are 2 major categories of regular expressions: Perl uses left-biased searches that look at alternatives (a|b) and try ‘a’ first, and only try ‘b’ if ‘a’ fails, here (a|b) is very different from (b|a). Posix searches (e.g. grep) try both ‘a’ and ‘b’ and return the longest possible overall match. Thus (a|b) is the same as (b|a).

A left-biased seach uses back-tracking. A Posix seach uses an NFA. But one can also convert NFA’s into DFA’s for space/time tradeoffs. And one can still return sub-matches in parentheses using a DFA, see the work at http://www.laurikari.net/tre/ and (my) Haskell version at http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa

Comment from Ted Mielczarek
Time: October 7, 2008, 2:55 am

Thanks for a very readable and interesting writeup on a complex technical topic!

Pingback from Ajaxian » Regex performance in modern JSVMs
Time: October 7, 2008, 2:57 am

[...] was the conclusion that David Mandelin of the Tamarin project as he looked into how “SquirrelFish Extreme (SFX) [...]

Pingback from Ajax Girl » Blog Archive » Regex performance in modern JSVMs
Time: October 7, 2008, 4:07 am

[...] was the conclusion that David Mandelin of the Tamarin project as he looked into how “SquirrelFish Extreme (SFX) [...]

Comment from Boris
Time: October 7, 2008, 5:05 am

Another option for the test is to have more than one ‘B’ char in the string and then to fail the test if they don’t all get replaced.

But yes, please file bugs on both the test and their buggy string-replace…

Comment from Ben Hearsum
Time: October 7, 2008, 6:39 am

I’m a lay-man to this kind of thing but this was a really interesting post nonetheless. Thanks!

Comment from Fritz
Time: October 7, 2008, 7:33 am

I don’t know much about this whole situation, but “trampoline” is a technical term in programming, and therefore may be more than just a typo or watermark. See for example the Wikipedia article at . I first heard the term in connection with work on continuations, in the context of Scheme, I think, but the Wikipedia entry suggests that related technical usages predate that work (which was probably mid to late 80s).

Comment from RichB
Time: October 7, 2008, 8:45 am

The test already has a bug report:

https://bugs.webkit.org/show_bug.cgi?id=18989

Comment from nemo
Time: October 7, 2008, 10:00 am

https://bugs.webkit.org/show_bug.cgi?id=18989
Here’s one of ‘em at least.

Comment from Boris
Time: October 7, 2008, 4:21 pm

Fritz, I assume the typo/watermark part David is talking about is the “gererate”, not the “trampoline”.

Pingback from Javascript News » Blog Archive » Regex performance in modern JSVMs
Time: October 7, 2008, 7:43 pm

[...] was the conclusion that David Mandelin of the Tamarin project as he looked into how “SquirrelFish Extreme (SFX) [...]

Comment from Mark Rowe
Time: October 7, 2008, 10:48 pm

It was a simple typo that I fixed in http://trac.webkit.org/changeset/37383.

Pingback from Css howto » Regex performance in modern JSVMs
Time: April 23, 2009, 7:32 am

[...] was the conclusion that David Mandelin of the Tamarin project as he looked into how “SquirrelFish Extreme (SFX) [...]

Write a comment