Main menu:

Site search

Categories

Archive

PLDI Sampler, part I

I’m back from PLDI in Dublin. The weather was nice and the Guinness was excellent. The research program was also very good. It seemed like this year there was a lot of variety and a different topic selection from previous years. I’m going to blog some notes on an arbitrary (slanted toward things with more likely Mozilla application) sample of talks.


PetaBricks: A Language and Compiler for Algorithmic Choice

Jason Ansel, Cy Chan, Yee Lok Wong, Marek Olszewski, Qin Zhao, Alan Edelman, Saman Amarasinghe

PetaBricks aims to help programmers produce well-tuned, possibly parallel, high-performance algorithms. The initial focus has been on dense array algorithms, and also sorting as a basic example. In PetaBricks, the programmer can express directly a choice of algorithms and tuning parameters. For example, a programmer can write a sort procedure that offers a choice between insertion sort, n-way merge sort (with n as a tuning parameter), and quicksort. The PetaBricks compiler will then test the algorithm over a range of input sizes and determine the best algorithmic choices. For example, for sorting on an 8-way Xeon, PetaBricks found that it’s best to do a 2-way merge sort until the sublists are smaller than 1420 elements, then quicksort until the sublists are smaller than 600, and finish with insertion sort. Programmers can do this sort of thing by hand, but it’s a ton of work, and has to be redone for every platform.

We at Mozilla have already done a bit of autotuning of TraceMonkey’s “magic number” parameters (such as the number of times to attempt to record a certain trace before “blacklisting” it), and Jesse and I have had discussions about autotuning other parameters in Mozilla. The PetaBricks paper probably has other ideas and insights we could apply in this domain. We could also consider using its ideas more directly to tune Mozilla data structures such as hash tables and string representations.

Staged Information Flow for JavaScript
Ravi Chugh, Jeff Meister, Ranjit Jhala, and Sorin Lerner

This work is designed to protect against attacks made by JavaScript added to web pages by third parties. For example, if site A includes a box with advertising content from site B, the content from B could include JavaScript that makes some kind of attack, or maliciously modifies the main content of A. Site A can in theory prevent this kind of attack by providing an information flow policy and then running a static analysis on the entire web page to verify that the policy is obeyed. For example, site A could specify that JS from other sites is not allowed to navigate away by writing to document.location. Then, site A would provide a static analysis (possibly as a Firefox addon) that would analyze the entire page before executing any JS.

The main problem is that such a static analysis would run very slowly on the client machine–there would be a delay of 10 seconds or more, which is clearly unacceptable. Enter the staged analysis: this paper describes how the author of site A can run a static analysis on just the code from site A, which determines a small set of things that have to be checked about site B, the “residual” check. Then, when the user goes to site A and JS is loaded from third parties, the residual check is run on the third-party JS in about 0.1 seconds. This seems potentially useful on the web and is also a great example of how to partition a static analysis into a long-running ahead-of-time analysis and a fast online analysis.

Efficiently and Precisely Locating Memory Leaks and Bloat
Gene Novark, Emery D. Berger, Benjamin G. Zorn

This paper is about, uh, detecting memory leaks and bloat. More specifically, the paper describes how to make a particular leak detection technique, staleness metering, run fast and produce no false positive leak reports.

As background, the “staleness” of a heap-allocated object is simply the length of time since that object was last touched (by reading from it or writing to it). Thus, objects with low staleness are unlikely to be leaks, while objects with high staleness are probably either leaks (if they will never be accessed again) or bloat (if they are just accessed rarely and there is a way to compact them or release them and recompute them later). The main problem with measuring staleness is that doing it exactly would slow down the program by 100x or so. The main solution that was explored previously was to sample a fraction (say, 1/1000) of memory accesses and estimate staleness from that. The main further problem is that if only 1/1000 of accesses are measured, then a lot of objects may appear stale when they are really not, just because sampling missed the accesses.

This paper proposes a new way of doing the sampling, “data-based” sampling, that produces no false positives. The key mechanism is to use the virtual memory system to protect heap pages so there will be a page fault when they are touched. When a page is touched, the current time is recorded as the estimated time of last touching for every object on the page. Thus, this time provides an underestimate of the staleness of every object on the page, and so objects with high estimated staleness must be stale in reality.

Without any additional cleverness, what I just described would be really slow, because every heap access would incur a page fault. The additional cleverness is that the paper uses a heap allocator that groups objects from the same allocation site and of the same approximate allocation time (and thus age) on the same pages. The paper says that objects that have the same allocation site and age tend to have the same access pattern. Given that fact, it makes sense to classify pages adaptively as hot or cold, based on how recently they were touched. Hot pages are left unprotected and not tracked, so most heap accesses don’t page fault. And this shouldn’t cause the detector to miss many leaks, because the allocation strategy means a hot page probably contains mostly hot objects, which aren’t leaks. Conversely, cold pages are protected and so they are tracked closely, but this doesn’t cause many page faults because cold pages aren’t accessed often.

The authors put all this together into a tool called Hound, which they tested on some standard benchmarks. Unfortunately, it doesn’t appear to be distributed online currently. Running a program under Hound is supposed to increase heap usage by only 10% or so, and execution time usually only a few percent (although more in some cases), so it sounds like it would be great to be able to enable Hound for nightly test builds.

PLDI 2009

Next week we’re taking the TraceMonkey show on the road to PLDI 2009 in Dublin, Ireland. Andreas Gal will of course be presenting our paper Thursday afternoon, and David Anderson (who is back at Moz for the summer) is also coming. Fortunately, the Mozilla 1.9.1 (FF 3.5) blocker list for JS is empty, and hopefully, despite my best efforts, will stay that way. (I set the release tree on fire last night by checking in a typo with a patch to enable one of my blocker bug fixes. Fun.)

Otherwise, I’ve been pretty busy stomping TraceMonkey bugs. I’ve also been doing a lot of work on getting TM to trace accesses to variables defined in lexically enclosing scopes (i.e., upvars/funargs/closures) and the JS arguments construct. A lot of that work is still ongoing or was finished too late to be adequately tested for FF 3.5, so it’s only on trunk, but the first improvement I made, tracing upvars defined off the current trace, is in the release branch.

Finally, Nick Nethercote has been fixing up my TraceVis instrumentation so it can be checked into the main Mozilla tree. Soon it should be possible to enable TraceVis in a standard build. I’m also starting to hear about some other work to make TraceVis more useful for browser builds and longer runs.

TraceMonkey@PLDI

Last fall, we submitted a paper on TraceMonkey to PLDI (officially: ACM SIGPLAN Conference on Programming Language Design and Implementation), one of the top conferences for programming language research. Our paper was accepted, and Andreas Gal will be presenting the paper on June 18 in Dublin, Ireland.

We’re hosting a PDF copy on the blog. And here’s a direct link: TraceMonkey PLDI 2009 PDF.

OSQ: the next 5 languages for the web

In the evening at the OSQ retreat, we had some informal discussions about new scripting language designs and languages for mobile devices. The starting points for these discussions were (a) that scripting language programmers will soon want to use parallelism, and (b) that mobile devices will have uniprocessor performance at least 12x slower than laptops for the foreseeable future.

New JS VM techniques (like TraceMonkey’s) are getting us a 2-4x speedup compared to what we’ve been used to for a language like JS. Is that good enough for mobile? I don’t know. Bill’s results (in my last post) suggest that one way or another, we can get a further 2-3x, which might bring us to a 4-12x speedup compared to the old JS stuff. That might be good enough, or maybe not.

The next question in my mind is, if we want to someday use some new language to get a bigger speedup, what should it look like? The first thing is that it should have a reasonably good translator to JS, so that it can run on any browser, even if the fancy new language is not supported. After that, I think two interesting starting points are Ocaml and Lua.

Ocaml is a high-level typed language with lists, garbage collection, and a lot of other features to make life easy for programmers. The biggest problem is the types: many programmers seem to prefer untyped languages like Python and JS, and types make prototyping more difficult. The types are also a great advantage: they really help with reliability and performance: in shootout-type benchmarks, Ocaml achieves performance very close to C.

Lua is a scripting language in the vein of Python and JS with simple, well-documented semantics. This simplicity makes it possible to deploy a tiny Lua interpreter. It also helps with performance: Lua has a small, general set of hooks for metaprogramming, while Python and SpiderMonkey have several such mechanisms, which has a cost in complexity and performance. Also, Lua doesn’t include complex features like inheritance in the base language, which presumably makes it a lot easier to optimize property accesses.

A related question is how to expose parallelism. I tend to favor mechanisms that don’t share mutable state, such as DOM worker threads or map-reduce. One reason is that in scripting languages, property accesses are not at all atomic (they potentially require multiple hashtable lookups) so every property access has to be locked. Even with a lot of clever implementation, the locks are still pretty costly, and even for programs that are not actually concurrent: we think on the order of 10-30% in SpiderMonkey. (And Python TBMK still has the global interpreter lock, and the official recommendation is to use multiple processes as a better way of getting real concurrency.)

Parallelism is easier in functional languages like OCaml, because they don’t mutate state very much–mostly they just create new values and read them. Also, in a typed language, a property access is much more like a simple memory read, which is easier to make atomic, or atomic enough.

So, I would like to see at least 2 new web languages, which could be JavaScript dialects, evolutions of other existing languages, or something entirely new. The first, which might be called MiniJS, would be an untyped scripting language, which would be used for most applications. MiniJS would look like JS with simplified semantics (and no ‘with’ statement for sure) and support for concurrent programming. The other language, which could be called TypedJS, would be a typed functional language inspired by OCaml, Scala, and perhaps even the dreaded ES4. TypedJS would be used for applications with stringent performance and reliability requirements. The two languages should be able to communicate but I don’t think they need to be mixed freely: in fact, making them part of the same language would probably make it too hard for each language to realize its full potential.

OSQ: Dynamic language optimization

Random bits from the second half of yesterday’s OSQ events:

Bill McCloskey gave a quick talk on his experience doing some optimizations to Python. (From a VM optimizer’s point of view, Python and JS are very similar.) He had previously used whole-program type inference to drive a translation to C, and in this way got performance about a factor of 2 or 3 worse than C, which is excellent for a dynamic language. But he said that the inference process was kind of slow, so he wanted to try some lightweight techniques to see how far they would go.

First, he tried something he called “dictionary hints”, in which he used a static analysis to guess the address of the result of a hash table property lookup (as in ‘o.f’). He modified the interpreter to look there first, and only if that fails do the general hash table lookup. That gave him a nice speedup (20%+). IIUC, dictionary hints are the static analogue of property caching, as done in SpiderMonkey (and Nitro and V8, I believe).

Bill also tried replacing Python’s basic value representation of a pointer to the heap with a tagged value that keeps integers inline, thus avoiding the extra memory load and a lot of extra memory allocations. He got a big speedup (3-4x) from that. SpiderMonkey also already uses this technique, as do Nitro and V8.

In summary: he found these rough levels of run time relative to C in practice:

  • 2-3x for Python compiled by Dynamite (his type inference-based compiler)
  • 4-10x for Python run with psyco (a dynamic specializer somewhat related to TraceMonkey)
  • 20x for normal Python

I haven’t looked at JS vs. C lately, but I believe TM is currently on the order of 4-10x vs. C, like pysco. In theory, dynamic compilers should be generate code that runs at least as fast as static compilers, because they have strictly more information. But in practice, static compilers seem almost always to be faster. In this case, Bill’s static compiler can generate very fast code, but I think it takes a few minutes to do the type inference, so it would be beneficial in a dynamic compiler only for a very long-running program. (If you have 2 cores, the VM could start the compiler at the same time the program starts running, and as long as the program runs for several minutes, it would eventually speed up greatly.) But at least on the Web, programs that need even 1 minute of CPU time are still rare. But this may change.

Berkeley OSQ Retreat

I’m at the Berkeley OSQ (Open Source Quality project) Retreat in Santa Cruz right now representing Mozilla. It’s an annual event where professors and grad students present their latest research results and ideas.

There’s a lot of good stuff here, so I’m just going to blog about a few things that seem particularly relevant to the web. Juan Caballero just finished his talk, Secure Content Sniffing for Web Browsers, or How to Stop Papers from Reviewing Themselves. He explained content-sniffing XSS attacks and told us about his work with Adam Barth and Dawn Song on analyzing website and browser vulnerabilities to those attacks, and their recommendations, which have been adopted by the HTML 5 working group.

He also mentioned some binary program analysis (dynamic, static, and “concolic” hybrid) frameworks their group is using, which they call BitBlaze.

Google Summer of Code: Rodrigo Sol

A little while ago, Rodrigo Sol at the Federal University of Minas Gerais submitted a Google Summer of Code proposal to create a new register allocator for TraceMonkey. I’m pleased to announce his proposal was accepted!

Rodrigo’s project idea is to implement Register Allocation By Puzzle Solving, a paper in PLDI 2008 by his advisor, Fernando Pereira, and Jens Palsberg. The paper describes a register allocation algorithm based on a model of fitting puzzle pieces (program values) into a board (the register file). Their results show allocation speed equal to the fastest good allocators (such as LLVMs) and also high performance for the generated code (in the register allocation aspect), which sounds like exactly what we’d love to have in TraceMonkey.

The end result will be a new register allocator for nanojit, TraceMonkey’s back-end code generator. Initially, I’m hoping Rodrigo’s allocator will reduce the number of spills and register-to-register moves, which should improve performance by an as yet unknown amount on all TraceMonkey applications. It should also help other applications that use nanojit, such as the simple regular expression compiler I coded up a while back. (Side note: if anyone out there is looking for a compiler-related project, extending and/or redesigning that compiler might be fun.) Later on, I hope the new allocator will help enable other TraceMonkey advances, such as carrying values in registers across trace loop edges and from one trace to another.

Rodrigo has a wiki for his project, including a blog and links to related articles and papers.

TraceVis: performance visualization for TraceMonkey

I’ve been working on a visualization of TraceMonkey performance, with the goal of revealing what the JS VM is doing, and why it runs certain programs fast or slow, so we can figure out how to make the slow ones fast too. In this post, I want to show off the results and explain how to read them, hopefully explaining a bit about how TraceMonkey works in general while I’m at it.

Background on TraceMonkey. First, I need to explain how TraceMonkey works at a high level. Readers who already know the basic ideas behind TraceMonkey can skip to the next section.

Before TraceMonkey (TM), we had an interpreter. The fundamental idea of TM is to select hot (frequently run) regions of JS code, compile them to fast native code (x86, ARM), and then use the native code for those regions.

In all compilers, generating fast native code requires a lot of information about the program’s run-time behavior. Specifically, the compiler needs to know what values are constant in what regions, what branches are or are not always taken, and, crucially for dynamic languages like JS, what types variables have in what regions. For example, in compiling JS a+b, if we don’t know the types of a and b, the compiler needs to generate native code to handle every type combination, along with type tests and branches on the run-time types. And then the compiler has to carry along extra bits to record the type, and previous operations that set a and b need to generate those extra bits. And so on. That’s a lot of native code, and it’s only slightly more efficient than the interpreter. Conversely, if the compiler knows the types are both, say, double, then the compiler can simply generate a native add instruction, which is as fast as possible.

It’s hard for the compiler to figure out types and other such information for dynamic languages like JS. One reason it’s hard is the lack of type declarations, but there are others, such as the ability of eval to create new variables with values of any type at any time.

TraceMonkey solves this problem by collecting information for the compiler dynamically. That is, when TM wants to compile a certain region of code, it actually runs that region in the interpreter. As the interpreter runs, TM records the path taken through the code and all the types and constant values seen. The result is a linear trace through the code with type, value, and branch annotations. The compiler (nanojit) can then relatively easily compile the trace to fast native code.

Now, for many programs this works wonderfully and the program runs 2-20x faster in TraceMonkey vs. the pure interpreter, but some programs don’t speed up or run more slowly. We need to understand why in order to improve TM.

TraceMonkey VM Activities. To better understand TM performance, I broke down what TM does into 6 major activities.

  • When TM starts running a program, it always starts by interpreting the program, exactly as in the non-tracing JS engine.
  • When execution reaches a point where TM might want to start a compiled trace, TM spends a bit of time monitoring the execution: checking to see if it already has a compiled region, counting the number of times passed, and deciding whether to start a trace. Monitoring is a kind of overhead: while monitoring, TM isn’t running the user’s program, but monitoring is a necessary cost of finding and optimizing traces.
  • If TM does decide to create a new compiled trace, it runs in the interpreter while recording the trace, including operations and types of values. During this time, it is running user code a little slower than the basic interpreter.
  • When the trace is finished, TM compiles the trace to native code. This is another form of overhead.
  • As I mentioned above, as part of monitoring, TM checks to see if it already has a compiled native trace starting at the current point. If so, TM selects the right trace and prepares to run it, which I call executing the trace. This is a third form of overhead.
  • Finally, TM can be running native code compiled previously. Compiled native traces run 2-20x faster than the interpreter, with a typical speedup factor of about 2.5.

Visualizing TraceMonkey Activities. Now, let’s see how that looks in a picture. I’m going to use 2loops.js, a sample program that computes the mean and variance of the numbers 0-999,999 using two separate loops:

var n = 100000;
var sum = 0;
for (var i = 0; i < n; ++i) {
  sum += i;
}
var sum_squares = 0;
for (var i = 0; i < n; ++i) {
  sum_squares += i * i;
}
var mean = sum / n;
var variance = sum_squares / n - mean * mean;
print('mean:     ' + mean);
print('variance: ' + variance);

Here is the TraceVis output. Click for a version large enough to actually read. The numbered boxes going clockwise around the chart show how to read each element of the chart and what the chart tells us about what TM is doing.


TraceVis output

TraceVis on SunSpider Click here for TraceVis output for all the SunSpider benchmarks. I give the speedup vs. pure interpretation on the front page so you can get a feel for what pictures go with slow and fast execution. Here are a few examples interpreted in detail:

crypto-sha1 traces very well, with a 6x speedup vs. the interpreter. The picture looks a lot like the picture for 2loops.js, but with more traces:


TraceVis crypto-sha1

The middle purple and blue stripes are interesting: TM had to create 6 native traces before it was able to really switch to native code. Figuring out why this happens requires additional tools. In this case, because there aren’t too many traces, debug spew (environment variable TRACEMONKEY=verbose) is readable. To find where all the traces start recording, I took a debug shell build and ran TRACEMONKEY=verbose dist/bin/js -j ../t/crypto-sha1.js | grep starting. I got:

recording starting from ../t/crypto-sha1.js:221@119
recording starting from ../t/crypto-sha1.js:152@29
recording starting from ../t/crypto-sha1.js:152@39
recording starting from ../t/crypto-sha1.js:63@154
recording starting from ../t/crypto-sha1.js:63@159
recording starting from ../t/crypto-sha1.js:90@5
recording starting from ../t/crypto-sha1.js:91@31
recording starting from ../t/crypto-sha1.js:92@52
recording starting from ../t/crypto-sha1.js:55@111
recording starting from ../t/crypto-sha1.js:177@34

There are 10 places recording started, corresponding to the 10 purple bands in TraceVis. Numbering from 1, 4-9 are the traces for the middle band. The first two traces are for the loop starting at line 61, which is an inner loop. (Inner loops get hot before outer loops, so they tend to be traced first.) The traces cover two different paths through the if. Then, the tracer ends up discovering 3 different hot paths through the function sha1_ft around line 90, so they are traced as well. Finally, an outer loop at line 53 gets hot, so it gets traced as well. At this point, there are enough traces to cover all the cases, so we get to stay on native code until the containing function returns.

Thus, the purple/blue banding pattern followed by a long stretch of green indicates the buildup of several traces through a loop to account for different paths or types or inner and outer loops. Once enough traces have been compiled to cover all the hot code of a loop (or set of nested loops), the chart goes green until the loop is done.

If you zoom in, you can actually see short bands of yellow and green in between the wide bands of purple and blue. This is because after compiling one of these traces, TM starts tries to execute the native code. Often TM gets several iterations in right away before it exits and needs to record more traces. Computers being fast and all, “several iterations” is usually a few microseconds or less.

3d-cube traces moderately well, achieving a 2.5x speedup over the interpreter. Let’s see why it doesn’t trace as well as sha1:


TraceVis 3d-cube

Here we have the familiar purple/blue buildup of compiled traces, but once we finish compiling, we don’t go solid green. Instead, we get a pattern that looks like blades of grass with red tips on a white background. Zooming in and looking at the vertical dimension, we can see that we get a bit of red (monitoring), then a bit of yellow (preparing to execute), then a strip of green (native code), then a bit of yellow (cleaning up after executing), then a bit of red (finish monitoring), then a strip of white (interpreter), and the pattern repeats.

This means we are repeatedly starting to execute native code, but then we are forced to leave the native code for the interpreter within a few microseconds. Evidently, the native code runs pretty fast, because we are getting a good total speedup even though we spend about 1/3 of our time in the base interpreter.

The next question is why we are leaving native code and returning to the interpreter, which this visualization doesn’t answer. There are about a dozen different reasons why we can exit from native code. I have another prototype instrumentation patch that counts the different exit reasons and and how much time we spend in the interpreter for each reason.

For 3d-cube, the vast majority (3808 of 3907) of these exits are for a reason I apparently called loop2 when I wrote the patch. As you may expect, I didn’t know what loop2 actually meant, but after conferring with Andreas, I learned that loop2 means we exiting an outer loop. When we exit an inner loop, we know we are still inside a loop, so we are probably in some hot code and should continue trying to record traces. But if we exit an outer loop, we are presumably back to one-shot code, which does not benefit from tracing.

But we exit this way 3808 times, which is impossible unless we are inside a loop. So something must be wrong. I used debug spew again, this time looking for “leaving trace” messages. Most of them are at lines 57 and 100 of 3d-cube.js. Line 57 is a loop inside a function DrawLine that appears to iterate over the pixels in a line. DrawLine is called only in a function DrawQube. DrawQube is called by Loop, which implements a loop by recursion, which TraceMonkey currently doesn’t support.

So, we can probably speed up 3d-cube even more by either (a) tracing recursion, (b) tracing tail recursion as loops (Loop is tail recursive), or (c) extending traces from outer loops if the outer loop exit becomes hot. And we didn’t actually know this until now.

string-tagcloud traces badly: it runs 5% slower with the JIT turned on.


TraceVis string-tagcloud

The pattern has 3 phases.

In phase 1, which accounts for about half of total time, we are running in the pure interpreter. The first real code run by this test is: var tagInfo = tagInfoJSON.parseJSON(...) over 175k of JSON. parseJSON is recursive. TM doesn’t even see any loop headers in that part, so TM never even goes to monitoring mode.

In phase 2, we have a white background with purple streaks and lots of red dots. The purple streaks are recording traces, but note that there is no blue to represent compilation. So the traces must be aborting for some reason; i.e., they encounter some feature TM doesn’t know how to trace. The red dots indicate that TM is seeing loop headers and monitoring them, but not doing anything with them, probably because no traces can be successfully compiled for them.

To understand phase 2 in detail, we need to know why recording is failing. Grepping the spew for ‘Abort’ will get the answer. In this case, all but 1 of the aborts are: Abort recording (line 184, pc 64): callname. This means tracing is stopping on a JSOP_CALLNAME bytecode in the interpreter. Line 184 contains a recursive call to a function walk defined inside another function. Poking around a bit, I noticed that a detailed message would be printed with lower-case ‘abort’, so I tried that and got abort: 4815: fp->scopeChain is not global or active call object. I found that error message in the code, and after a little debugging I found the reason the abort is activated.

The problem is that during lookup of the variable walk in the function call expression, walk is found in an enclosing scope that is a function call scope that is not part of the trace. Whenever that happens, the tracer aborts. I think this is just an implementation limitation: in order to call the function, the tracer needs to have the function in its “tracker”, so that it can refer to the LIR opcode that loads the function. (That’s just how it works right now.) But if the function is defined “above” the trace, then the tracer’s tracker will not have seen it.

I think this wouldn’t matter if we supported recursion, because then the outer scope would be part of the trace anyway.

Phase 3 consists of a little bit of recording and compiling, followed by a big patch of native code execution. That’s the loop inside the function makeTagCloud, which doesn’t do anything scary and traces well.

Conclusion. I’ve explained how to read TraceVis charts, and then showed how to read the charts along with other debug info to diagnose performance problems (or performance wins) in detail.

I want to thank my girlfriend Natalie for inspiring TraceVis by sending me an article on visualizations in computer forensics.

The code is available in my personal hg repository. You can check out a local copy by running

hg clone http://hg.mozilla.org/users/dmandelin_mozilla.com/tracevis/

The repo includes the patch that instruments TM with activity counters and a bunch of Python scripts for processing the outputs. The image generation scripts require PIL (Python Imaging Library).

Looking at SunSpider charts is fun, but I really want to apply this tool to performance problems in real applications like bugs 463487, 463478, 465773, and 468840. I’d also like to look at the V8 benchmarks, where TM currently doesn’t do as well as it does on SunSpider.

SunSpider Statistics, Part II

In my last post, I laid out some questions I wanted to answer. I decided to tackle these questions empirically, by running SunSpider 100 times with –runs=10, for a total of 1000 test suite runs, and then see if SunSpider’s confidence intervals and tests come out as predicted. Empirical methods are nice because you can do interesting statistics without doing hard math.

The first one:

Question 1: Are SunSpider times normally (gaussian) distributed?

There are a bunch of normality tests out there but I didn’t want to do anything complicated. Instead, I just made a histogram of times and compared it to the best fit normal distribution. For my 1000 runs, the mean was 1232.2 and the standard deviation was 11.48. Here is a plot of the observed times in black and the normal distribution with mean 1232.2 and standard deviation 11.48 in red (ignore the blue line for now):

Normal Distribution Fit

The shape is vaguely bell-curve like, but the scaling really isn’t normal. In particular, the peak of the empirical data is about twice as high as the normal distribution model, and the data is much more concentrated a center band. Also, the tail of the real data goes much further to the right than the normal distribution. In particular, for the longest time I got, 1329ms, the normal distribution has a probability of 1.3×10^-14. Finally, the left tail of the data falls off a bit faster than the normal distribution.

It’s safe to say these times are not normally distributed. If they are, then my sample is a once-in-the-lifetime-of-the-earth rarity. Another reason not to expect a normal distribution in the first place is that a normal distribution can generate any value, positive or negative, but SunSpider times can only be positive. So I tried fitting a related positive-only distribution, the log-normal, shown in blue. To me, it looks neither a better fit nor a worse one.

I think these times might come from a model with two different kinds of noise:

T = m + En + Ep

where m (”mean”) is the baseline time as before. En is noise centered at zero with a higher peak than a gaussian distribution. I’m not sure why the peak is so pointy; it might be because of the limited (1ms) timer resolution. Ep is another source of noise that is always positive (making the program run longer), perhaps with an exponential or log-normal distribution. I think Ep comes from rare events that cause relatively long delays, such as a background process waking up to check for updates on some program I never use.

Although the times aren’t normally distributed , and all of SunSpider’s statistics are based on assuming a normal distribution, the statistics might work out all right in practice. We have to check them separately.

Question 2: Are SunSpider’s confidence intervals valid?

SunSpider gave me 100 95% confidence intervals for my 100 runs. If they are valid confidence intervals, then about 95 of them should contain the true mean.

Of course, I have no way of getting the true mean from any finite number of samples, but I’m going to arbitrarily assume that taking the mean or median of my 1000 samples is about right. The mean of 1000 samples is 1232.2. The median of the 100 averages is 1232.4, so it doesn’t really matter which one I use. I used the median because I figured it would reduce the effect of the outliers at the top.

In my trial, 86 of 100 confidence intervals contained the median. I think that’s pretty good, given the complex distribution of the underlying process.

For comparison, I did the same confidence interval test 20 times on sets of 100 fake SunSpider runs where the results are generated from a true normal process with the same mean and standard deviation. In those tests, 87 to 97 of the results contained the median, usually around 93 or so, about as predicted. (Because the median isn’t the true mean, the confidence intervals probably miss the median more often than they miss the true mean.)

So, for the total time, the SunSpider confidence interval is probably pretty good if regarded more as a 90% confidence interval instead of 95%.

I did the same experiment for each individual test as well. On those, the score ranged from 64 for 3bit-bits-in-byte to 100 for cordic, with most of them around 85-95. Broadly, I’d say the results are about as reliable as those for the total score. It does seem that the longer tests tended to do worse, probably because the rare-delay noise gets more of a chance to kick on longer tests.

On medium length tests (runtime 8-80ms or so), the confidence intervals seem very reliable. I figured that the normality assumption might hold better for these tests, but it turns out they have pointy peaks and long right tails too. The real reason the median is always in the confidence interval on cordic is that (a) the results are very tightly clustered (99% are either 22ms or 23ms) so most runs are close to the mean, and (b) there are some outliers, bumping up the standard deviations a bit.


Question 3: Are SunSpider’s significance tests valid?

I tested the significance indicators using a similar procedure as for confidence intervals. Using my 100 runs, I did 50 pairwise comparisons. Because they are all for the same JavaScript engine, all differences are random noise, so there should be a “significant” indicator on about 2 or 3 pairs (5%).

SunSpider reported “significant” in 4 cases (8%). That’s really pretty good, and I’d be willing to believe that the extra 1.5 “significant”s are just a feature of my sample, except for the fact that all my other tests so far show SunSpider’s statistics to be a little less precise than with a true normal distribution.

On the individual tests, “significant” indicators ranged from 0% to 24% of trials. I couldn’t find any real pattern in the numbers. But I would say that the “significant” indicators aren’t really reliable at the 5% level for individual tests. Also, from experience with the data and actually debugging performance effects, I’ve found the significance indicator is fairly useful on the total, but not on the individual tests.

Whether this is good enough or not depends on how you intend to use the tests. For a one-off run, I’d say the significance indicator on the total is solid, and the others can be used with a grain of salt and in the light of other data.

But if you run SunSpider all the time, like we do, you would get a lot of false positive differences even on the total. I’d say a 99% significance level would be more useful in that case, which corresponds to maybe doubling the width of SunSpider’s confidence intervals. For a –runs=10 run, that means a difference of 15ms or more is significant.

But given the right-tail outliers, if you do see a significant time increase, it’s probably always a good idea to rerun to make sure you weren’t looking at an outlier. Seeing two outliers in a row is pretty unlikely.

In part III, I’ll get to the questions of how to process the results in practice. Also, over the weekend I’m going to try to run a series of tests over our revision history, so I should also have some results to report from that next week.

SunSpider Statistics, Part I: Questions

The TraceMonkey project uses SunSpider regularly to measure JavaScript performance. The problem is that the results are often hard to interpret. Is an 11 ms improvement a real speedup or just random? We all have our intuitions about how to use the numbers, but I wanted to try a little statistics on the problem. The big questions are: what do those SunSpider confidence intervals and significance tests really mean, what tests can developers use day-to-day to check their work, and what tests can we use over time to look for regressions? Below I lay out the statistical background behind SunSpider’s results and make these questions more concrete.

SunSpider’s confidence intervals. As I’m sure you know, a basic SunSpider result looks like this:

============================================
RESULTS (means and 95% confidence intervals)
--------------------------------------------
Total:                 1226.2ms +/- 0.3%
--------------------------------------------

  3d:                   176.6ms +/- 0.5%
    cube:                43.8ms +/- 1.0%
    morph:               34.5ms +/- 1.1%
    raytrace:            98.3ms +/- 0.7%

  access:               155.2ms +/- 0.5%
    binary-trees:        45.3ms +/- 0.8%
    fannkuch:            67.0ms +/- 0.7%
    nbody:               28.7ms +/- 1.2%
    nsieve:              14.2ms +/- 4.0%

  ... and 19 more individual tests in 7 groups

Doing the multiplication, the 95% confidence interval works out to 1222.5-1229.9ms. What this confidence interval really means is rather technical.

First, SunSpider’s confidence intervals are based on a gaussian model of run time. The model assumes that the time taken on any given run is a random variable

T = m + E

where m (”mean”) is a constant “baseline” run time and E (”error”) is a random variable representing noise. The model also assumes that E is gaussian with mean 0. If the model is perfectly true in reality, then there is a strong mathematical interpretation of the confidence interval. But to the extent that the model deviates from reality, the confidence interval is less meaningful.

This brings up Question 1: Do SunSpider times vary with a gaussian distribution?

If the model is true, then over a large set of SunSpider runs, the baseline time will be in the confidence interval in 95% of runs. With the run I showed above, this seems to imply that, the baseline time is in the interval 1222.5-1229.9ms with probability 0.95, but according to the standard statistical interpretation, that interpretation is wrong, because 1222.5, 1229.9, and the baseline time are not random variables, so probability does not apply. But I don’t think the intuitive interpretation is that bad.

Question 2: In a large set of SunSpider runs, is the baseline time in the calculated confidence interval 95% of the time?

SunSpider’s significance tests. SunSpider can compare two sets of results, giving output like this:

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           *1.009x as slow*  1226.2ms +/- 0.3%   1237.3ms +/- 0.8%     significant

=============================================================================

  3d:                  *1.031x as slow*   176.6ms +/- 0.5%    182.1ms +/- 2.0%     significant
    cube:              *1.032x as slow*    43.8ms +/- 1.0%     45.2ms +/- 2.7%     significant
    morph:             ??                  34.5ms +/- 1.1%     34.8ms +/- 2.3%     not conclusive: might be *1.009x as slow*
    raytrace:          *1.039x as slow*    98.3ms +/- 0.7%    102.1ms +/- 3.4%     significant

  access:              ??                 155.2ms +/- 0.5%    156.3ms +/- 1.2%     not conclusive: might be *1.007x as slow*
    binary-trees:      *1.015x as slow*    45.3ms +/- 0.8%     46.0ms +/- 1.5%     significant
    fannkuch:          ??                  67.0ms +/- 0.7%     67.3ms +/- 1.8%     not conclusive: might be *1.004x as slow*
    nbody:             ??                  28.7ms +/- 1.2%     29.0ms +/- 1.2%     not conclusive: might be *1.010x as slow*
    nsieve:            -                   14.2ms +/- 4.0%     14.0ms +/- 2.4%

Differences are reported as “significant”, “not conclusive”, or blank, presumably meaning “not significant”. SunSpider’s “significant” means that the difference is significant at the 0.05 level according to the t test, which brings up more technical bits.

In general, significance tests are meant to separate observed differences that are real (the underlying baseline means are different) from random differences (the means are the same, and only the noise was different). In a basic significance test, we implicitly assume the means are the same, and then look to see if our observations tend to disprove that. If our observations would be very unlikely under that assumption, then we say the difference is significant. (c.f. if I think a certain roulette wheel is fair, and it comes up “13″ 10 times in a row, I have good reason now to think it’s not fair.) So, to say that two runs are different at the 0.05 level is to say: if the means are really the same, then there was only a 5% chance of getting runs that different. In other words, if I go around doing experiments looking for differences when there aren’t any, I’m going to get p=0.05-significant results 5% of the time.

Since SunSpider has 26 tests, this means if you do 2 runs of the same system, if the significance tests are applicable, you’d probably see 1 or 2 false “significant” differences in the results.

The t test is a significance test that applies when comparing two experiments with the gaussian model introduced above, where each experiment runs N trials and averages them. There’s a bunch of mathematical detail, but the basic idea is fairly simple. The idea is to construct a confidence interval not for a mean in one experiment, but for the difference in means in two experiments, say a 95% confidence interval. If that confidence interval doesn’t contain 0, then we’ve seen a result that’s pretty unlikely (<=5%) if the means are the same. So the difference is significant.

The mathematical formulas used in the t test are just to correct for the fact that we don't know the real standard deviation, just an estimate from our data. They're a more complex analogue to the mysterious "n-1" used in place of "n" in computing sample standard deviations and deriving 1-experiment confidence intervals.

The question for SunSpider is Question 3: If we run a large number of SunSpider comparisons of identical JavaScript engines, does SunSpider say “significant” about 5% of the time?

SunSpider in practice. Our everyday use of SunSpider is through a little script called bench.sh that runs one trial and prints out the total time. The idea is that a developer can run it a few times, make a change to TraceMonkey, and then run it a few more times and see if it seems to be better or worse. It’s supposed to be a quick test, taking 5 or 10 seconds to run and interpret, that can be run after any little change. So I want to know the answer to Question 4: What can we do with 3-5 SunSpider runs? How small of performance changes can it reasonably detect? What’s the best way to look at the numbers: average them, take the minimum, or something else?

Another use of SunSpider is to run over time, over our Mercurial history, to look for performance regressions that sneaked in with other changes. For that purpose, the runs will be automatic, so it’s OK if they take a while: we can do a lot of trials. So here, I want to answer Question 5: how many runs are needed to detect the smallest performance change that counts? To make that question real, we also need to know what kind of performance change counts. Theoretically, a 1ms slowdown is bad, because after 100 checkins like that, we’ve slowed down by 100ms, which definitely matters. But we seem to get changes of plus or minus 1-3ms “randomly” with many checkins just because gcc’s optimizer does something a little different. So I figure 5ms or so is a real difference that we’d like to detect.

Summary. I’ve laid out two broad areas I’d like to learn more about. First, I want to know if SunSpider’s statistical claims hold up in reality, and if not, where the deviation is. Second, I want to figure out the right number of runs to use, how to combine them, and how small of differences they can detect for a few important practical applications.

In Part II, I’ll start trying to answer these questions.