Main menu:

Site search

Categories

Archive

SunSpider Statistics, Part II

In my last post, I laid out some questions I wanted to answer. I decided to tackle these questions empirically, by running SunSpider 100 times with –runs=10, for a total of 1000 test suite runs, and then see if SunSpider’s confidence intervals and tests come out as predicted. Empirical methods are nice because you can do interesting statistics without doing hard math.

The first one:

Question 1: Are SunSpider times normally (gaussian) distributed?

There are a bunch of normality tests out there but I didn’t want to do anything complicated. Instead, I just made a histogram of times and compared it to the best fit normal distribution. For my 1000 runs, the mean was 1232.2 and the standard deviation was 11.48. Here is a plot of the observed times in black and the normal distribution with mean 1232.2 and standard deviation 11.48 in red (ignore the blue line for now):

Normal Distribution Fit

The shape is vaguely bell-curve like, but the scaling really isn’t normal. In particular, the peak of the empirical data is about twice as high as the normal distribution model, and the data is much more concentrated a center band. Also, the tail of the real data goes much further to the right than the normal distribution. In particular, for the longest time I got, 1329ms, the normal distribution has a probability of 1.3×10^-14. Finally, the left tail of the data falls off a bit faster than the normal distribution.

It’s safe to say these times are not normally distributed. If they are, then my sample is a once-in-the-lifetime-of-the-earth rarity. Another reason not to expect a normal distribution in the first place is that a normal distribution can generate any value, positive or negative, but SunSpider times can only be positive. So I tried fitting a related positive-only distribution, the log-normal, shown in blue. To me, it looks neither a better fit nor a worse one.

I think these times might come from a model with two different kinds of noise:

T = m + En + Ep

where m (”mean”) is the baseline time as before. En is noise centered at zero with a higher peak than a gaussian distribution. I’m not sure why the peak is so pointy; it might be because of the limited (1ms) timer resolution. Ep is another source of noise that is always positive (making the program run longer), perhaps with an exponential or log-normal distribution. I think Ep comes from rare events that cause relatively long delays, such as a background process waking up to check for updates on some program I never use.

Although the times aren’t normally distributed , and all of SunSpider’s statistics are based on assuming a normal distribution, the statistics might work out all right in practice. We have to check them separately.

Question 2: Are SunSpider’s confidence intervals valid?

SunSpider gave me 100 95% confidence intervals for my 100 runs. If they are valid confidence intervals, then about 95 of them should contain the true mean.

Of course, I have no way of getting the true mean from any finite number of samples, but I’m going to arbitrarily assume that taking the mean or median of my 1000 samples is about right. The mean of 1000 samples is 1232.2. The median of the 100 averages is 1232.4, so it doesn’t really matter which one I use. I used the median because I figured it would reduce the effect of the outliers at the top.

In my trial, 86 of 100 confidence intervals contained the median. I think that’s pretty good, given the complex distribution of the underlying process.

For comparison, I did the same confidence interval test 20 times on sets of 100 fake SunSpider runs where the results are generated from a true normal process with the same mean and standard deviation. In those tests, 87 to 97 of the results contained the median, usually around 93 or so, about as predicted. (Because the median isn’t the true mean, the confidence intervals probably miss the median more often than they miss the true mean.)

So, for the total time, the SunSpider confidence interval is probably pretty good if regarded more as a 90% confidence interval instead of 95%.

I did the same experiment for each individual test as well. On those, the score ranged from 64 for 3bit-bits-in-byte to 100 for cordic, with most of them around 85-95. Broadly, I’d say the results are about as reliable as those for the total score. It does seem that the longer tests tended to do worse, probably because the rare-delay noise gets more of a chance to kick on longer tests.

On medium length tests (runtime 8-80ms or so), the confidence intervals seem very reliable. I figured that the normality assumption might hold better for these tests, but it turns out they have pointy peaks and long right tails too. The real reason the median is always in the confidence interval on cordic is that (a) the results are very tightly clustered (99% are either 22ms or 23ms) so most runs are close to the mean, and (b) there are some outliers, bumping up the standard deviations a bit.


Question 3: Are SunSpider’s significance tests valid?

I tested the significance indicators using a similar procedure as for confidence intervals. Using my 100 runs, I did 50 pairwise comparisons. Because they are all for the same JavaScript engine, all differences are random noise, so there should be a “significant” indicator on about 2 or 3 pairs (5%).

SunSpider reported “significant” in 4 cases (8%). That’s really pretty good, and I’d be willing to believe that the extra 1.5 “significant”s are just a feature of my sample, except for the fact that all my other tests so far show SunSpider’s statistics to be a little less precise than with a true normal distribution.

On the individual tests, “significant” indicators ranged from 0% to 24% of trials. I couldn’t find any real pattern in the numbers. But I would say that the “significant” indicators aren’t really reliable at the 5% level for individual tests. Also, from experience with the data and actually debugging performance effects, I’ve found the significance indicator is fairly useful on the total, but not on the individual tests.

Whether this is good enough or not depends on how you intend to use the tests. For a one-off run, I’d say the significance indicator on the total is solid, and the others can be used with a grain of salt and in the light of other data.

But if you run SunSpider all the time, like we do, you would get a lot of false positive differences even on the total. I’d say a 99% significance level would be more useful in that case, which corresponds to maybe doubling the width of SunSpider’s confidence intervals. For a –runs=10 run, that means a difference of 15ms or more is significant.

But given the right-tail outliers, if you do see a significant time increase, it’s probably always a good idea to rerun to make sure you weren’t looking at an outlier. Seeing two outliers in a row is pretty unlikely.

In part III, I’ll get to the questions of how to process the results in practice. Also, over the weekend I’m going to try to run a series of tests over our revision history, so I should also have some results to report from that next week.

SunSpider Statistics, Part I: Questions

The TraceMonkey project uses SunSpider regularly to measure JavaScript performance. The problem is that the results are often hard to interpret. Is an 11 ms improvement a real speedup or just random? We all have our intuitions about how to use the numbers, but I wanted to try a little statistics on the problem. The big questions are: what do those SunSpider confidence intervals and significance tests really mean, what tests can developers use day-to-day to check their work, and what tests can we use over time to look for regressions? Below I lay out the statistical background behind SunSpider’s results and make these questions more concrete.

SunSpider’s confidence intervals. As I’m sure you know, a basic SunSpider result looks like this:

============================================
RESULTS (means and 95% confidence intervals)
--------------------------------------------
Total:                 1226.2ms +/- 0.3%
--------------------------------------------

  3d:                   176.6ms +/- 0.5%
    cube:                43.8ms +/- 1.0%
    morph:               34.5ms +/- 1.1%
    raytrace:            98.3ms +/- 0.7%

  access:               155.2ms +/- 0.5%
    binary-trees:        45.3ms +/- 0.8%
    fannkuch:            67.0ms +/- 0.7%
    nbody:               28.7ms +/- 1.2%
    nsieve:              14.2ms +/- 4.0%

  ... and 19 more individual tests in 7 groups

Doing the multiplication, the 95% confidence interval works out to 1222.5-1229.9ms. What this confidence interval really means is rather technical.

First, SunSpider’s confidence intervals are based on a gaussian model of run time. The model assumes that the time taken on any given run is a random variable

T = m + E

where m (”mean”) is a constant “baseline” run time and E (”error”) is a random variable representing noise. The model also assumes that E is gaussian with mean 0. If the model is perfectly true in reality, then there is a strong mathematical interpretation of the confidence interval. But to the extent that the model deviates from reality, the confidence interval is less meaningful.

This brings up Question 1: Do SunSpider times vary with a gaussian distribution?

If the model is true, then over a large set of SunSpider runs, the baseline time will be in the confidence interval in 95% of runs. With the run I showed above, this seems to imply that, the baseline time is in the interval 1222.5-1229.9ms with probability 0.95, but according to the standard statistical interpretation, that interpretation is wrong, because 1222.5, 1229.9, and the baseline time are not random variables, so probability does not apply. But I don’t think the intuitive interpretation is that bad.

Question 2: In a large set of SunSpider runs, is the baseline time in the calculated confidence interval 95% of the time?

SunSpider’s significance tests. SunSpider can compare two sets of results, giving output like this:

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           *1.009x as slow*  1226.2ms +/- 0.3%   1237.3ms +/- 0.8%     significant

=============================================================================

  3d:                  *1.031x as slow*   176.6ms +/- 0.5%    182.1ms +/- 2.0%     significant
    cube:              *1.032x as slow*    43.8ms +/- 1.0%     45.2ms +/- 2.7%     significant
    morph:             ??                  34.5ms +/- 1.1%     34.8ms +/- 2.3%     not conclusive: might be *1.009x as slow*
    raytrace:          *1.039x as slow*    98.3ms +/- 0.7%    102.1ms +/- 3.4%     significant

  access:              ??                 155.2ms +/- 0.5%    156.3ms +/- 1.2%     not conclusive: might be *1.007x as slow*
    binary-trees:      *1.015x as slow*    45.3ms +/- 0.8%     46.0ms +/- 1.5%     significant
    fannkuch:          ??                  67.0ms +/- 0.7%     67.3ms +/- 1.8%     not conclusive: might be *1.004x as slow*
    nbody:             ??                  28.7ms +/- 1.2%     29.0ms +/- 1.2%     not conclusive: might be *1.010x as slow*
    nsieve:            -                   14.2ms +/- 4.0%     14.0ms +/- 2.4%

Differences are reported as “significant”, “not conclusive”, or blank, presumably meaning “not significant”. SunSpider’s “significant” means that the difference is significant at the 0.05 level according to the t test, which brings up more technical bits.

In general, significance tests are meant to separate observed differences that are real (the underlying baseline means are different) from random differences (the means are the same, and only the noise was different). In a basic significance test, we implicitly assume the means are the same, and then look to see if our observations tend to disprove that. If our observations would be very unlikely under that assumption, then we say the difference is significant. (c.f. if I think a certain roulette wheel is fair, and it comes up “13″ 10 times in a row, I have good reason now to think it’s not fair.) So, to say that two runs are different at the 0.05 level is to say: if the means are really the same, then there was only a 5% chance of getting runs that different. In other words, if I go around doing experiments looking for differences when there aren’t any, I’m going to get p=0.05-significant results 5% of the time.

Since SunSpider has 26 tests, this means if you do 2 runs of the same system, if the significance tests are applicable, you’d probably see 1 or 2 false “significant” differences in the results.

The t test is a significance test that applies when comparing two experiments with the gaussian model introduced above, where each experiment runs N trials and averages them. There’s a bunch of mathematical detail, but the basic idea is fairly simple. The idea is to construct a confidence interval not for a mean in one experiment, but for the difference in means in two experiments, say a 95% confidence interval. If that confidence interval doesn’t contain 0, then we’ve seen a result that’s pretty unlikely (<=5%) if the means are the same. So the difference is significant.

The mathematical formulas used in the t test are just to correct for the fact that we don't know the real standard deviation, just an estimate from our data. They're a more complex analogue to the mysterious "n-1" used in place of "n" in computing sample standard deviations and deriving 1-experiment confidence intervals.

The question for SunSpider is Question 3: If we run a large number of SunSpider comparisons of identical JavaScript engines, does SunSpider say “significant” about 5% of the time?

SunSpider in practice. Our everyday use of SunSpider is through a little script called bench.sh that runs one trial and prints out the total time. The idea is that a developer can run it a few times, make a change to TraceMonkey, and then run it a few more times and see if it seems to be better or worse. It’s supposed to be a quick test, taking 5 or 10 seconds to run and interpret, that can be run after any little change. So I want to know the answer to Question 4: What can we do with 3-5 SunSpider runs? How small of performance changes can it reasonably detect? What’s the best way to look at the numbers: average them, take the minimum, or something else?

Another use of SunSpider is to run over time, over our Mercurial history, to look for performance regressions that sneaked in with other changes. For that purpose, the runs will be automatic, so it’s OK if they take a while: we can do a lot of trials. So here, I want to answer Question 5: how many runs are needed to detect the smallest performance change that counts? To make that question real, we also need to know what kind of performance change counts. Theoretically, a 1ms slowdown is bad, because after 100 checkins like that, we’ve slowed down by 100ms, which definitely matters. But we seem to get changes of plus or minus 1-3ms “randomly” with many checkins just because gcc’s optimizer does something a little different. So I figure 5ms or so is a real difference that we’d like to detect.

Summary. I’ve laid out two broad areas I’d like to learn more about. First, I want to know if SunSpider’s statistical claims hold up in reality, and if not, where the deviation is. Second, I want to figure out the right number of runs to use, how to combine them, and how small of differences they can detect for a few important practical applications.

In Part II, I’ll start trying to answer these questions.

Seven things you may not know about me

I’ve been tagged by bsmedberg:

Rules

  1. Link back to your original tagger and list the rules in your post.

  2. Share seven facts about yourself.
  3. Tag some (seven?) people by leaving names and links to their blogs.
  4. Let them know they’ve been tagged.

Seven Things

  1. I hail from the frozen tundra of Wisconsin, where the current temperature at my mom’s house (hi, Mom!) is -1.2 °F. That’s one aspect I don’t miss, although I still know how to survive it.
  2. My first computer was a TI-99/4a.
  3. I have a younger sister, who currently lives in the Atlanta area.
  4. My girlfriend’s name is Natalie. She is a super attack dog lawyer, so be sure not to libel me or scratch my car or anything.
  5. My cat’s name is Leslie, although I usually call her Beeboo, or some sickeningly cutesy variation thereof. She has appeared on this blog before.
  6. I have been known to cook (Cooking for Dummies turned out to be an amazing training manual), although I’ve been too busy lately to make anything elaborate. My last non-trivial effort was a shrimp risotto I made with Natalie.
  7. When I was about 2 years old, I apparently ran into a wall or something and hit my head really hard. This may explain a few things.

I’m gonna take my time about tagging anyone else. Except Andreas. My excuse is that bsmedberg took half the people I’m on a tagging basis with.

Static Analysis Newslets

And, now that I’m posting again, I should offer a little news about other recent events.

Jason Orendorff has successfully used Treehydra and a few GCC attribute annotations to add a read barrier to the JS frame pointer in TraceMonkey.

I’m not exactly sure what that means myself, but I think the idea is this: Part of how TraceMonkey speeds things up is by constructing activation records for JS function calls only if and when it is necessary. But then TraceMonkey can’t safely call arbitrary native functions, because the native function might try to access those activation records, and thus crash. So TraceMonkey can’t speed up traces that call arbitrary native functions. Jason’s analysis verifies that every function that accesses the activation records is either already in a state where they are available, or first calls a function that makes them available. His analysis is pretty powerful: if a function needs the activation records only on certain paths, the analysis only requires it to have them on those paths, thus allowing it to run faster otherwise. I believe all of this is a first step toward tracing through all sorts of DOM calls, which is essential to speeding up the web as a whole.

Graydon Hoare is just starting on an analysis application to verify that errors are handled correctly in TraceMonkey/SpiderMonkey. He’s doing the equivalent of Java checked exceptions in C. In Java, if you call a function like ‘void foo() throws FooException’, then whenever you call foo(), Java requires that you either declare that you too can throw a FooException, or you catch it in a catch block.

SpiderMonkey uses C-style error reporting, i.e., the return value of a function indicates whether there was an error. So the equivalent of checked exceptions is something like this:

  // The attribute tells us that a JS_FALSE return value indicates an
  // error, and that the possible error conditions are OOMError and
  // AsmError.
  JSBool foo() __attribute__(("failure:JS_FALSE:OOMError,AsmError"))

  JSBool fooCaller() {
    JSBool rv = foo();
    if (!rv) {
      return JS_FALSE;
      // Error! If foo() fails, we have to call an "error handler" like
      // handleAsmError().
    }
    ...
  }

At a high level, this is similar to Jason’s analysis, but all sorts of practical details differ, and practical details are annoying significant in static analysis. Graydon’s currently bravely trying out the largely undocumented ESP library APIs.

In other news, a little while back I implemented a simple regexp compiler that translates JS regular expressions to native code by way of nanojit, TraceMonkey’s cross-platform backend. I guess the WebKit team doesn’t have a blog where they would bust me for having a really incomplete implementation, so I’ll have to do it myself. My implementation so far is less complete than theirs, but it is good enough at least to give massive speedups on SunSpider’s regexp-dna.js and simple regular expressions of that kind. I’ll really bust myself by saying I haven’t implemented “dot” yet. I think V8 has a regexp compiler too, now.

I think the really important difference between the TraceMonkey regexp compiler and WebKit’s (WREC) is in the backend. TraceMonkey uses nanojit, a cross-platform compiler that turns linear (or mostly-linear) LIR code sequences into optimized native code. WREC, the last time I looked at it, uses an x86 assembler library. Thus, WREC can very directly implement a human-designed x86 code generation and register allocation pattern, making good x86 code. Also, the assembly process is a lot quicker than nanojit’s compilation process. But unlike nanojit, it doesn’t have the opportunity to automatically optimize the code, and is x86-only. In practice, this means WREC wins on getting to the breakeven point faster (compiling takes time, which you win back as you run the regexp against more and more text characters), and running a hair faster on regexp-dna, while TraceMonkey wins on being cross-platform.

I guess as the author I’m not really qualified to say so, but I do think my little regexp compiler is pretty simple and clean, so if anyone out there is interested in getting into some compiler hackery, it might be a fun place to start. And there’s lots left to be done.

A History of Insanity in the Age of x86

It’s been a long time since i’ve blogged–I’ve been pretty deep in coding mode. But bug 471822 has been fixed, and it’s time to celebrate with a post.

Bug 471822 is a TraceMonkey performance regression on SunSpider of about 70 ms or so that Andreas Gal noticed recently. And it was worse than a simple slowdown, it was an intermittent slowdown, making benchmark scores hard to use to drive optimization efforts.

The first really interesting thing I noticed about the bug was that the performance regression depended on the length of the command line used by SunSpider to run the JS engine. If I renamed one of the files to a longer name and ran with the longer command line, the benchmark could be either slower or faster. That’s really weird. I spent some time reading the source code to bash, thinking that maybe there was a bug in bash that altered the way it started the program. But there was nothing to be found there, and eventually I figured out how to replicate the problem using my own little C program with fork and execve.

Somehow, I figured out that the real effect of command-line length is to alter the value of the stack pointer register (esp) on program startup, because command-line arguments and environment variables are passed into main as arguments. That’s just as weird as the idea of command-line length influencing perf. Fast-forward past many hours of performance profiling, hardware performance counter profiling, manual clock timer instrumentation, blah, blah, blah, and eventually I had apparently localized the perf hit to one part of one particular trace compiled for the access-nbody benchmark that did a bunch of floating-point math, but also reached the point where the other data I had collected on the problem basically made no sense.

Azathoth The patron saint of microarchitectural perfomance analysis.

So I sat down on a bean bag and started swapping crazy ideas with Andreas. We suspected that there was a problem with floating-point math, and we knew that none of the sensible theories about the bug were consistent with the data. Andreas was talking about data alignment, and in particular how he had proved that on-stack floating-point numbers were always stored in addresses offset by nice multiples of 8 from the frame pointer register, ebp, when I remembered that I had discovered earlier that day that on compiled traces, ebp always had an address ending in c (e.g. 0×0176f00c).

(Background: JavaScript numbers correspond to the C type double or IEEE-754 double-precision floating-point numbers, which are 8 bytes long. For various mysterious reasons, if you are working with 2^k-byte values, it is usually better to keep them in addresses that are multiples of 2^k. Thus, doubles should be stored in addresses that are multiples of 8, or in other words, addresses that end with 0 or 8 in hex. Compilers like gcc do this automatically, but if you are writing your own compiler, it’s all up to you.)

We verified the misalignment. I also remembered that in functions compiled by gcc, ebp always ends in 8. The easiest way to test the theory was to modify nanojit so that ebp is 8-byte-aligned, so Andreas did that, I tested it, and it worked.

With the 8-byte alignment in place, the inconsistent performance went away. On my laptop in the JS shell, that corresponded to an 7x speedup in access-nbody. I also got a 5-15% improvement in the other math-heavy benchmarks. Andreas got only a 1% total speedup on his newer Penryn laptop, but he was happy to have consistent benchmarks again. We think Penryn must be smarter about unaligned accesses, or benefits from its 24-way set-associative L2 cache (vs. 16 on mine), or probably just something 10x weirder.

Squirrelfishing regexp-dna.js

Recently I’ve been trying to figure out exactly how SquirrelFish Extreme (SFX) is kicking our butts so badly on regexp-dna.js, by about 5x on my machine. Numerically, (WebKit Regular Expression Compiler) WREC provides most of that 5x, but there are some weird twists to the story. My main conclusions: WREC indeed makes for a good regex engine, but also, WREC bails out of any regex with parentheses, and regexp-dna.js contains a bug that favors SFX over SpiderMonkey.

Update: The bug in regexp-dna.js has been fixed in WebKit! Thanks, WebKit team! That was fast. The new code:

for(k in subs)
dnaInput = dnaInput.replace(k, subs[k])
// FIXME: Would like this to be a global substitution in a future version of SunSpider.

Technical details: the design of WREC. There are two main ways to implement regular expressions: using a backtracking matching engine, or by transforming the regex to a finite automaton (NFA, aka “state machine”), which does not backtrack. Most Perl-type regex engines, including both SpiderMonkey’s and WREC, follow the backtracking design. I don’t know the exact history of that choice, but at present it is much easier to implement features like group capture and backreferences in the backtracking design. Also, although some regexes scale only if implemented as NFAs, my tests suggest that many simple regexes, including those in SunSpider, are faster with backtracking.

As of this writing, WREC’s implementation strategy is dirt simple (which is a good thing). There are no transformations or fancy optimizations on the regex. WREC simply generates native code that directly implements the backtracking search. Thus, within a single match operation, there are no function calls, no traversals of regular expression ASTs, and few option tests, so almost all of the overhead is eliminated.

WREC’s code is very easy to read, so if you want to know exactly how it works, just read it in WREC.cpp. It’s also great example code for anyone implementing a compiler for a simple language like regular expressions. The basic plan is to parse the regular expression with functions named things like parseDisjunction (the | operator). Those functions directly call functions like generateDisjunction that generate the native code using the same assembler that the call-threading interpreter uses. There’s also the oddly named “gererateParenthesesResetTrampoline”. Inexplicably preserved typo, or watermark to detect copying of WREC code?

An an example, for the regex /a|bb/, the generated native code looks something like this:

curPos = start
if curPos >= textLength goto CASE2
if text[curPos] != ‘a’ goto CASE2
curPos += 1
goto DONE_MATCHED
CASE2:
curPos = start
if curPos >= textLength goto DONE_FAIL
if text[curPos] != ‘b’ goto DONE_FAIL
curPos += 1
if curPos >= textLength goto DONE_FAIL
if text[curPos] != ‘b’ goto DONE_FAIL
curPos += 1
goto DONE_MATCHED

As you can see, the generated code does almost nothing beyond what is absolutely needed to implement the matching correctly. Also, curPos, textLength, and the start address of text are all kept in fixed registers. Based on a variety of microbenchmarks, I suspect that the critical path for this code is simply the unavoidable work of reading all the characters in the text, so the code is near maximum efficiency.

Backtracking in WREC. One tricky part of a backtracking regex engine is the backtracking. Consider matching the regex /a*a/ against the text ‘a’. First, the engine will greedily match /a*/ with ‘a’, leaving a remaining regex fragment of /a/ and a remaining text of the empty string. When the engine tries to match /a/ against ”, it fails. Now it must backtrack, specifically by going back to the /a*/ step and finding the next longest match. The next longest match is ”, leaving a remaining regex fragment of /a/ and a remaining text of ‘a’, and the engine will finish with a successful match.

I was curious how WREC implemented backtracking. For example, with /a*a/ and a text of a million ‘a’s, I wanted to know if WREC saves all one million shorter matches in case it needed to backtrack them (which seems like a waste of space), or if WREC recomputes them if needed (which could be slow if a lot of backtracking needs to be done). I found that WREC generates code like this for a quantified regex a* (I reordered things to make it easier to read, but the logic is the same):

// First, greedy match
repeatCount = 0
while (next char is ‘a’) {
curPos += 1
repeatCount += 1
}
while (true) {
// Now, try to match the rest
save curPos
… code to match the rest of the regex …
if (rest of regex matches) goto DONE_MATCHED
// backtrack
restore curPos
if curPos <= 0 goto DONE_FAILED
curPos -= 1
repeatCount -= 1
// loop around and try to match the rest from here

}

So the answer to my question is that WREC doesn’t save the intermediate matches, it just keeps track of the length of the text matched against /a*/ (repeatCount in my code) and decrements the position and count in order to backtrack. This is great, because it uses only two words of memory for state (instead of N million) but can find the next longest match very quickly (just 4 instructions).

But this left me wondering, how does WREC handle something like /(a|bb)*x/, where backtracking needs to go back a variable number of characters? Here’s the code that replaces ‘repeatCount-1′ with the backtracking logic needed here:

void GenerateParenthesesNonGreedyFunctor::backtrack(WRECGenerator*) {
// FIXME: do something about this.
CRASH();
}

Interesting. This function isn’t called anywhere, so it’s not a crashing bug in WebKit. The reason it’s not called:

bool WRECParser::parseParentheses(JmpSrcVector&) {
// FIXME: We don’t currently backtrack correctly within parentheses in cases such as
// “c”.match(/(.*)c/) so we fall back to PCRE for any regexp containing parentheses.
m_err = TempError_unsupportedParentheses;
return false;
}

Very interesting. And on my machine, dna-regexp.js runs in about 43 ms (in jsc, the SFX command-line shell), but if I add an empty pair of parens to each regex in the test, it takes 206 ms. Breaking out the time for regexp-dna.js regex matching only (i.e., not including the string building and replace calls), with the parens SFX is 15% slower than SpiderMonkey’s, which is actually still pretty good.

It just so happens that SunSpider’s regexp-dna.js doesn’t use parens in any of its regexes.

But the web does, of course. To find out how common parens are, I instrumented a Firefox build to record regexes being run on web pages and opened some popular pages. Firefox processed 830 unique regexes, of which 287, or just about 1/3, contained parens.

String.replace. The other major portion of regexp-dna.js tests String.replace. Again, SFX is about 5x as fast as SpiderMonkey and I wanted to know why. The calls to replace in regexp-dna all look like this:

dnaInput.replace(’B', ‘(c|g|t)’, ‘g’)

This is supposed to replace all occurrences of ‘B’ with ‘(c|g|t)’. The search string is just ‘B’, so no regex processing is needed, just simple string searching. So I started reading WebKit code in StringPrototype.cpp. I saw that there is a special case fast-path for a non-regex search string, which makes sense. In fact, it’s so special and so fast that I couldn’t even find any code to implement the ‘$&’ replacement string syntax required by the ECMAScript standard. Testing SFX:

> ‘aba’.replace(/a/, ‘x$&x’)
xaxba // good: $& is the matched text
> ‘aba’.replace(’a', ‘x$&x’)
x$&xba // oops

I also couldn’t find out how the ‘g’ flag was implemented, or even any code to read the third “flags” argument at all. Testing SFX again:

> ‘aba’.replace(’a', ‘x’, ‘g’)
xba

Hmmm. So the ‘g’ flag is not implemented at all. That’s OK, it’s not part of the ECMAScript standard, it’s just a SpiderMonkey extension. But that means the benchmark contains a flag that just serves to make SpiderMonkey do a lot more work than SquirrelFish. With that flag removed, so that both programs are doing the same replacement operation, the performance difference goes from 5x to 1.5x. Hopefully SpiderMonkey will have a similar fast path soon, which should bring that to parity.

But I think it’s pretty clear regexp-dna.js has a bug. Either the ‘g’ should be removed, or it should be recoded to do the global replacement in vanilla ECMAScript.

Final words. Based on its performance on the regexes it does handle, WREC is indeed an awesome design. regexp-dna.js, however, is flawed and exaggerates SFX performance.

We could use nanojit to make a regex compiler for SpiderMonkey that would perform as well as WREC. But I don’t know if it’s worthwhile yet. Regex performance is much less important for today’s web than it is for SunSpider–I hope to link to a report on that in a future post.

Static analysis newslets

I’ve been in interpreter-land lately, but I do help out a bit with static analysis projects when I get the chance. So I’d better post an update here on some interesting developments that haven’t been publicized yet.

First, Keith Schwartz (one of our interns) is making great progress on automatic const-correctification for Mozilla. The basic idea is to put const on as many declarations as possible without breaking the code or introducing casts. Keith has devised an algorithm based on type inference. Currently, he’s working on the Treehydra code to extract the type constraints from code. Because C++ is so complicated, there are a ton of details, and he’s had to master the insanity of GCC intermediate representations of pointers to member functions and calls through them. (If by chance, any readers know of a non-insane way to access them in GCC, let us know.)

Second, the Cairo folks were kind enough to give me an hour to talk about static analysis with the Hydras at their recent meetup. They already had 2 static analysis applications in mind. One was ensuring that internal-only return codes aren’t returned from the public API. The other was checking that integer and fixed-point values, both represented by C ints, aren’t mixed together. I think both of these can be formulated as type checking or inference problems.

I’m hoping we can extract a generic Treehydra type inference library from Keith’s code for the Cairo problems and others. One issue here is that Cairo is C, while Keith’s been using the C++ ASTs for his work. I don’t even know if Treehydra can read C ASTs at this point, but I think Treehydra’s design makes extending to C not too hard.

Chrome: First Look

I’ve been trying out Chrome on my Vista machine at home, and I read a few hundred comments on Chrome last night on Digg. Here are my first thoughts, just as a browser user and an engineer.

Responsiveness. The most noticable things right away are that the UI is minimalistic and very responsive. The screen responds promptly to clicks, and UI elements appear, disappear, and move around very quickly. Responsiveness makes any tool (or toy) more enjoyable to use. So I like that a lot.

AFAIK, Chrome’s UI is straight C++, so that plus the minimalism should make it very fast. Firefox, of course, has a bigger UI coded in XUL/JS, which gives FF more features and excellent extensibility. I would never want to see FF give up extensibility, but I would sure like a snappier FF. I have no idea what that would take; speeding up JS should help, perhaps some sort of XUL tuning or overhaul is needed as well.

funny picturesProcess Isolation for Tabs. Less technically known as process-per-tab. This looks great. Diggers love the idea as well. Personally, I want it so that one bad web page or Flash app doesn’t hang my whole browser. That seems to be what the Digg users like best about Chrome too. There’s also reliability-one really bad page won’t crash the whole browser, although FF crashes are rare for me anyway and the state restore option makes crashing not too painful. And security, so that a maliciously bad page can’t overwrite memory on anything else. I haven’t used Chrome enough yet to see these situations come into play, so I don’t know it will work in practice, but for now I’ll assume they got it right. There’s still the question of whether the Windows schedulers give enough isolation to maintain performance, too.

Anyway, I want it.

There is a tradeoff: Chome seems to use more memory, at least with several pages open. (I tend to run with 20-80 pages open at a time.) Chrome actually tracks the memory nicely. Type about:memory into the location bar, and you get a summary of memory for all your open browsers, including FF if it’s running. Very cool.

<technical-gobbledygook>Memory consumption on modern OSs is complex enough so that it’s difficult even to define clearly. Chrome’s “task manager” (press Shift+Esc) is reporting the private working set size. This is the same number Vista Task Manager displays by default. The private working set size is the amount of physical RAM allocated for the exclusive use of a process. (The full working set includes physical RAM accessible to to the process but shared with others. The term “working set” comes from the OS’s allocation strategy, which tries to allocate just enough physical RAM to cover the virtual memory the process touches frequently, which is called the “working set”.) about:memory also shows the virtual memory allocation (I think Windows calls it the “commit size”), which is the “true” (ignoring a few complexities, I’m sure) amount of memory used by the process. I think that for an active app, the total allocation is more relevant, because either it will be in memory, or frequently paged in and out, bogging everything down. For paused background stuff, the working set can be more important, but if it’s paged out, the total size will determine how long it takes to come back up when you ask for it. On my system, Chrome was using more memory by both metrics, but I didn’t use Chrome heavily enough to see any serious paging or thrashing, so I don’t know how it plays out in practice yet.</technical-gobbledygook>

The extra memory consumption might make Chrome less suitable for mobile devices. So process per tab should be an option for FF, but not required. Also because not everyone will want it on their non-mobile devices.

Ideally, I think process, thread, and DOM tree (i.e., tab) should be orthogonal concepts in the browser platform. The DOM tree is a unit of visual display and window control. The process is a memory protection domain. The thread is a unit of sequential execution. The user and the browser would control the policy of how many processes and threads are created and how they are assigned to tabs.

As an example of what you could do with this, you might have a default policy where each tab is a process, with each separate script or plugin being a separate thread inside that process. Thus, the tab is protected from other tabs, and its dynamic elements can run concurrently. But now say that one of those scripts loops. The other scripts might still be running OK, but with less CPU because of the looping script. If you want, you could shut down just this tab, and leave the other tabs unchanged. It’s dangerous to kill just one thread, because it can leave the memory inside the process inconsistent. But at worst, only this tab would crash. If it did, you could reload the page with a policy giving each dynamic element its own process. At this point, you could kill just one element on the page while leaving the rest running.

A more exotic example would be to assign two processes to one tab. This seems weird, but it would be possible if the 3 concepts were orthogonal. One use would be to allow two other elements (tabs or scripts) to interact closely with a element. Interprocess communication is expensive, but as long as the other two elements use the shared element one at time, we can just map the shared item into both processes, so they can access it relatively quickly.

Of course, that’s a lot of work. I hear there’s also some problem where the MacOS Core Graphics APIs make it harder to implement multiple processes for one browser window than it is on Windows.

Awesome Bar and Places. Chrome has some version of FF’s Awesome Bar and Places. Which is a good thing-at this point I would find it painful to use a browser that didn’t. I think the location bar has some Googlyness in it too, like Google Suggest or something. I started typing the names of some of my favorite blogs, and it got the popular ones after a few letters even though I’d never visited them before in Chome. That was pretty nice, but I imagine after you’ve used the browser for a while, you have visited your favorites anyway, so it won’t matter too much. But I use and love FF’s Awesome Bar so much, I think I would like having search/suggest even more integrated with it. It’s a small thing, but it could make the browser just that little bit more responsive and direct.

So far I’ve found the bookmarks interface kind of weird. It bothers me that I either have no bookmarks menu at all, or else an ugly toolbar strip across the top of the browser. But I think it’s good they’re trying something new-I think Places is great but will continue to improve. Personally, I use the Awesome Bar for all navigation to a single site. I only use the bookmarks menu to open a group in tabs (e.g., morning blog reading), or to bring up a group in the Ctrl+B view and click on various items in turn (e.g., Youtube playlist). So there’s probably some mystical third design (not to mention 17 existing FF extensions) that would be better for me.

I’m really not excited by tabs on top of the window, or the thumbnail view of open pages. That stuff doesn’t work when you have 26 tabs open.

V8, blah blah blah. V8 is the other hot topic on Digg. I’ve heard Zimbra runs really well on Chrome, but I personally haven’t done anything with it as a user where I really noticed a difference. TraceMonkey vs. V8 has been covered elsewhere, so I’m not going to bother with that. Plus all these VMs are pretty early, so I’m not terribly excited about day-by-day benchmarks.

What I am immediately interested in is what techniques the V8 team actually used, how it relates to my inline threading project, and what ideas I might be able to steal from it. But I’ll have to read the code and report on that another day.

Inline threading, TraceMonkey, etc.

It’s been a long time since I’ve posted here-I wanted to post some interesting results about speeding up SpiderMonkey using inline threading, but it turned out to be really hard and took a long time to get close enough to “interesting results”. At last, my patch is good enough to run SunSpider, and runs it 8% faster than baseline (non-tracing trunk SpiderMonkey from a few weeks ago), 10-20% faster on favorable benchmarks, and 48% faster on 3bit-bits-in-byte. So that’s pretty cool.

Of course, 8% looks pretty puny next to the huge gains of TraceMonkey (congrats to those guys, by the way). But I’m assured that interpreter speedups still count, so I’m chugging along. (Side note: inline threading speeds up SunSpider’s access-fannkuch benchmark by 22%, which has proved difficult to optimize with tracing.) (Side note 2: I’ve been told that TraceMonkey will hugely speed up our static analysis scripts, maybe 10x or so, which is great news.).

Insane, gory detail on inline threading, related optimizations, and detailed performance analyses can be found in bug 442379. I thought I’d go over the key ideas in this post:

Inline threading. Basically, this is yet another interpreter opcode dispatch technique. I previously wrote about opcode dispatch, concluding that direct-threading, in which the opcode is a target address, and the code to start the next op is a single indirect jump instruction is the ultimate efficient dispatch mechanism. It turns out I was wrong.

Inline threading is the “best”, because it gets dispatch down to 0 instructions. The idea is to create a buffer, and copy into it the native code for each opcode to be executed. For example, for a function body like “return a+3″, the opcodes are: JSOP_GETARG (0), JSOP_INT8 (3), JSOP_ADD, JSOP_RETURN. To inline thread this, we create a buffer and fill it with native code like this, using memcpy:

code for JSOP_GETARG
code for JSOP_INT8
code for JSOP_ADD
code for JSOP_RETURN

It’s like a really crude form of JIT compilation.

To run the function, we just jump to the start of the buffer, and then it all runs, with no further dispatch code. I’ve found that an average SpiderMonkey op executes about 35 instructions, so inline threading removes the 4 instructions for indirect threading, reducing this to 31, and should speed up SM by about 11%. Nice!

For more info on inline threading, see SableVM and this paper.

Hard Stuff. The only problem is that what I just described doesn’t actually work. For one, the compiler doesn’t necessarily compile each SpiderMonkey op handler into a single block of code. In fact, it usually reorders things a bunch, to help with code locality and reduce the number of jump/branch instructions executed on hot paths. So those are too hard to inline. (It could be done by disassembling parts of SpiderMonkey, analyzing the results, and doing code layout again, but that’s a bit much.)

Also, because most jump, branch, and call instructions express their targets using a relative address (an offset from the IP of the following instruction), any jump outside our little inline-threaded buffer becomes a jump to hell. So those would all have to be identified and patched, again with a dissassembler.

In general, small ops that don’t call functions, generate errors, or have much internal control flow can be inlined, but everything else can’t. Fortunately, “small ops” include some really common ones like JSOP_GETVAR and JSOP_POP, but the technique can’t be used without some special treatment for the big ops.

Call Threading/Context Threading. For big ops, I used something called call threading or context threading. Call threading has been used to produce big speedups on some interpreters, but turns out not to help at all for SpiderMonkey. But it can handle big ops in an inline-threading system, and after a lot of work, I at least got it to not slow down SpiderMonkey.

The idea of call threading is to create a native buffer, but fill it with calls to the opcode handlers instead of copying the whole handlers in. With the previous example, it gives you:

call &JSOP_GETARG
call &JSOP_INT8
call &JSOP_ADD
call &JSOP_RETURN

Those are x86 call instructions. For this to work, the opcode handlers have to end with ‘ret’ instructions. This means dispatch is 2 instructions, which is better than indirect threading’s 4, and I think better even than direct threading because that’s really 3 instructions if you count getting the op, which I should have before. Also, these call and ret instructions are highly predictable (99.9%+ prediction rate), unlike the indirect jumps used by direct and indirect threading, which are very unpredictable. Since a mispredict is very costly (~16 cycles on Core 2, I think), this gives a big speedup, 30% or so on some interpreters.

Except on SpiderMonkey, where unfortunately it doesn’t work and also slows things down. This was very frustrating.

The reason it doesn’t work is that when you execute a ‘call’ instruction, you push the return address onto the stack, decrementing the stack pointer ($esp) by 4. The opcode handler will then crash if it calls a function. There are actually several reasons why but the most important is that on most systems, the stack pointer has to be 16-byte-aligned when you call a function, and that -4 puts it off.

To make it work, I add extra code to unpush the return address right after the call, and then unpop it back right before the return. This works, but now we’re up to 4 instructions, so we haven’t saved any instructions over indirect threading. (I would love a better solution, but I couldn’t think of one.)

Next, it turns out that in SpiderMonkey, due to clever optimization by Igor & Co., the branch prediction rate is already excellent in practice: 80-100% on benchmarky-type programs. (The trick is make sure there is a separate indirect jump going out of the end of each opcode, so the processor can predict each one independently. Also, the Core 2 indirect branch predictor seems very smart.) So branch mispredicts are only costing an average 0-3 cycles per op. SpiderMonkey takes about 28 cycles per op, so this gives an estimated speedup of 0-12%.

Last, the really tragic thing is that the changes I made to make SpiderMonkey do call threading make GCC shoot itself in the face. For reasons I don’t entirely understand, GCC compiles the op handlers “differently”, so they run at least 5% slower. It took some doing to figure out how to keep the slowdown down to 5%, which makes call threading about even with indirect threading. That’s pretty disappointing, but at least it means I can call thread the big ops without penalty (on average), and then inline the small ops, getting some speedup. With the example code, I generate:

code for JSOP_GETARG
code for JSOP_INT8
call &JSOP_ADD
call &JSOP_RETURN

(The optimizer issue is incredibly arcane, but I think I traced the problem to an optimization pass called “gcse2-post-reload”, which is really partially-redundant-load elimination after register allocation. Any kind of partial redundancy elimination (PRE) can slow code down. The slowdown effect should be mitigated when using profile-guided optimization (PGO), but I couldn’t get GCC PGO to work on SpiderMonkey to test that theory.)

For more on call threading, see this paper, keeping in mind that the results would probably differ on Core 2 because of its presumed better indirect branch predictor.

Inlining-enabled optimizations. Inline threading some stuff and call threading the rest doesn’t yield exciting gains for SpiderMonkey, maybe 2% overall and 5-20% on the “good” benchmarks. (I guess that’s not too bad, the 2% overall just seems really weak.) But inline threading enables a bunch of other cool optimizations that could speed things up more. I’ve only done the easy ones so far, and that got the 48% speedup on 3bit.

Easy optimization #1 is to stop updating the SpiderMonkey PC. The call and inline threaded code tells what op handler to run next, so the PC is unnecessary. Actually, some ops do use the PC, and I’d like to make them not do that someday, but in the meantime I can just generate code to set the PC in my native code buffer:

mov 0, [pc]
code for JSOP_GETARG
mov 2, [pc]
code for JSOP_INT8
call &JSOP_ADD
mov 5, [pc]
call &JSOP_RETURN

It looks like I only saved the PC update for JSOP_ADD, but it’s better than that because the standard code has to load the PC, increment it, and then store it back, which I’ve replaced with just one instruction. The PC optimization is relatively easy and saves 0-3 instructions per op, which means a 0-10% speedup by itself. And it actually works.

“Easy” optimization #2 is to specialize certain opcodes. Take JSOP_INT8 as an example. This op pushes an integer onto the interpreter stack. That’s 3 or 4 instructions, to manipulate the stack pointer and store a value to it. But it also has to get the value out of the opcode stream and convert it to a jsval (the tagged value type of the interpreter), so it’s actually 9 instructions. (And JSOP_INT32 is much worse because it has to fetch 4 bytes from the opcode stream and shift and or them together.) But if we’re inlining JSOP_INT8, we can just inline the 3 or 4 instructions using the actual value, a version of JSOP_INT8 “specialized” for the given value. I do the specialization by writing the op in assembly code with a dummy value (0xdeadbeef), finding the location of the dummy as part of interpreter startup, and then patching that location as I inline. This seems crazy, but all the other design options seem crazy in their own ways, so I went with it for now. With specialization in play, the example looks like this:

code for JSOP_GETARG specialized for slot 0
code for JSOP_INT8 specalized for jsval for 3
call &JSOP_ADD
mov 5, [pc]
call &JSOP_RETURN

Because JSOP_GETARG and JSOP_INT8 use the PC only to get their arguments, which specialization bakes in, we get to remove some more PC updates too.

Compiler nitpickery. In itself, all this stuff works, and it’s been used in various research projects before, and probably some other interpreters by now. But this particular implementation of it depends a fair amount on what the compiler does: at least (a) that the compiler doesn’t reorder small ops too insanely, (b) that you can take the address of labels (to get the start and end of op handlers), (c) patching in some ‘ret’ instructions at runtime (needed to solve an obscure problem I didn’t feel like discussing here), and (d) that the compiler doesn’t deoptimize too badly when you do all this. That seems possible, but it’s not clear yet that this will play well across different versions of GCC and ICC. (Crazy side note: in my tests, GCC 4.2 on Mac regresses SpiderMonkey SunSpider by 5% vs. GCC 4.0.)

SquirrelFish

If you’re reading this, chances are that you already know about SquirrelFish, Appl/WebKit’s new Javascript implementation. Early tests show SquirrelFish to be 60% faster than WebKit 3.1 JS, 46% faster than Spidermonkey and 52% faster than TT (Tamarin Tracing) on SunSpider.

Clearly we have some work to do. The plan is to improve TT so that hot loops run highly optimized native code; TT’s optimizer is in the early stages, and we think there’s a lot of room for further optimization. For example, as explained in my previous posts, when TT jumps from one trace to another, it has to save the interpreter state to a standard format and then reload the state on the new trace, and no cross-trace optimization is possible. Ideas like trace folding have potential for a big improvement here.

But I still wanted to take a quick peek into why SquirrelFish’s interpreter is so fast. The announcement touts 3 key design features:

  • Using a bytecode interpreter.
  • Using direct-threaded interpreter dispatch.
  • Using a register-based bytecode interpreter.

Bytecode interpreter. The old WebKit JS was based on an AST walk, which is explained in some detail in the announcement. It’s well-known that bytecode interpreters are faster, so it’s no surprise Apple made this change. Spidermonkey and TT’s interpreter have always been bytecode interpreters.

Direct-threaded dispatch. Interpreters tend to spend a lot of time dispatching bytecode operations. Direct-threaded dispatch is a technique for efficient dispatch.

The obvious way to write a bytecode interpreter is with a switch statement inside a loop:

void run(Bytecode *ip) {
for (;;) {
switch (*ip++) {
case OP_ADD: …
case OP_JUMP: …

Each iteration runs one bytecode instruction. Each case of the switch handles one instruction type. It really doesn’t look like there’s any room for improvement here unless you look at the assembly code generated for the switch dispatch (from a tiny test interpreter I wrote today, comments added by me):

# ip is in %edx
# Check that switch expression (%edx) is in table range
cmpl    $9, (%edx)
leal    4(%edx), %ecx
ja    L26
# Look up case address offset for (%edx) in table (L37)
movl    (%edx), %eax
movl    L37-”L00000000006$pb”(%ebx,%eax,4), %eax
# Add base address to offset
addl    %ebx, %eax
# Indirect jump to computed address
jmp    *%eax

The basic idea is to keep a table of relative offsets to the cases, and then jump using that offset. Because the switch expression could evaluate to anything, the compiler must first generate a range check, so that if the switch expression doesn’t map to any case, the program leaves the switch instead of crashing unpredictably.

But in reality the range check is useless, because the interpreter can control what bytecodes actually appear, so this is 3 wasted instructions. Also, the lookup and base+offset address computation seems kind of clunky. I’d prefer dispatch code something like this:

jmp    *%edx

This is direct threading. In principle, the idea is simple: the instruction code (e.g., OP_ADD) is the address of the case target code. (In the basic design, instruction codes are integers.) Then, this jump is all you need for dispatch.

Coding up direct threading is weird; normal C compilers don’t know how to do it. But it can be done with GCC’s computed goto extension (see the paper on direct threading from the SquirrelFish announcement). See also my tiny interpreter.

I believe TT and Spidermonkey use an intermediate design called indirect threading, which gets most of the speedup of direct threading, but allows integer opcodes. The opcode is an index into a table of case target addresses. So the dispatch code has to look up the case target, then jump, something like:

leal   TABLE(%edx,4), %eax
jmp   *%eax

Not too bad. I have no idea how significant the difference between direct and indirect threading is in practice, but even a few percent speedup would be great.

Register-based interpreter. “Traditional” interpreter design keeps all operands and intermediate values of expressions on a data stack. (This is completely different from the call stack.) I can think of a few design reasons why this might be, but I really don’t know why it’s been done this way. Anyway, with a stack, bytecode for a line of code like “x = a + b + c” looks like this:

LOAD a  # push a onto stack
LOAD b
ADD      # pop 2 elements, add them, push the result
LOAD c
ADD
STORE x

One nice thing about this is that the algorithm for generating this bytecode is very simple, so it’s not a lot of work to code and runs fast. But there’s a lot of code just to manipulate the stack here, and it might be nice to avoid it. (Real stack-based programs have even more stack manipulation, with DUP, SWAP, ROT, etc. operations.)

A register-based design avoids the stack by using a fixed array of “slots” or “registers” for operands. (These registers are completely different from assembly code registers.) The register-based version of the above line of code would be something like:

ADD temp, a, b   # temp = a + b
ADD x, temp, c

Keep in mind that in the register-based bytecode, where I have “x” it would actually have something like “0″, if x had been assigned slot 0 in the register table. Note also that the bytecode generator has to analyze the code a bit to decide how many registers are needed and assign them to the variables and temporaries.

Each design has advantages and disadvantages:

  • Bytecode code size. Stack-based bytecode programs tend to be smaller, because the operands are implicit. But the register-based program gets to omit the stack manipulation instructions, so the stack-based bytecode should be only a little bit smaller. The memory savings shouldn’t matter in most environments, but reading less bytecode will save the interpreter a bit of time.
  • Bytecode instruction count. As shown above, a register-based program will have fewer bytecodes. Fewer bytecodes to execute means faster run times, if it takes the same length of time to execute a bytecode in each design. Which it doesn’t:
  • Operand access. To access operands, the stack-based interpreter just reads and writes from the top of the stack or very near it. A register-based interpreter needs a two-step process: first it has to get the address of that register, then read or write it. But the register-based version doesn’t need to adjust the stack. Still, it seems like more work per instruction for the register version: for ADD, the register version needs to compute 3 operand addresses, while the stack version just one (the new stack pointer).

Overall, when you look across instructions, I think the total amount of operand addressing computation is about the same. The stack version distributes it across more instructions. The stack version might be able to save a few by having special instructions like LOAD0 (to load slot 0, needing no computation). But the register version still has few instructions, so fewer dispatches. With direct threading, the dispatch is fairly efficient, so this is less important, but an indirect jump is still usually pretty expensive compared to a normal instruction because of the high branch mispredict probability.

Microbenchmarking. I wrote a tiny interpreter that runs a bytecode program for an empty for loop using doubles as numbers (as in JS) to test out these things for myself. I tried 4 design choices: direct threading vs. switch and stack vs. register. The times for 10M iterations in milliseconds:

Stack Register
Switch 230 90
Direct 105 55

Here you see huge differences. The reason the differences are so big (far bigger than WebKit got) is that my tiny interpreter’s opcodes are very simple. In a real JS interpreter, the code to run for the average operation is a lot longer, so dispatch overhead is a much smaller fraction of total time, and dispatch overhead is the main thing saved by direct-threading and registers.

Note that the speedup of register vs. stack is smaller in the direct-threaded case, as I predicted above. But not that much smaller, because the dispatch really is expensive compared to everything else in this simple interpreter.

You can get my microbenchmark code here.