Main menu:

Site search

Categories

Archive

SunSpider Statistics, Part I: Questions

The TraceMonkey project uses SunSpider regularly to measure JavaScript performance. The problem is that the results are often hard to interpret. Is an 11 ms improvement a real speedup or just random? We all have our intuitions about how to use the numbers, but I wanted to try a little statistics on the problem. The big questions are: what do those SunSpider confidence intervals and significance tests really mean, what tests can developers use day-to-day to check their work, and what tests can we use over time to look for regressions? Below I lay out the statistical background behind SunSpider’s results and make these questions more concrete.

SunSpider’s confidence intervals. As I’m sure you know, a basic SunSpider result looks like this:

============================================
RESULTS (means and 95% confidence intervals)
--------------------------------------------
Total:                 1226.2ms +/- 0.3%
--------------------------------------------

  3d:                   176.6ms +/- 0.5%
    cube:                43.8ms +/- 1.0%
    morph:               34.5ms +/- 1.1%
    raytrace:            98.3ms +/- 0.7%

  access:               155.2ms +/- 0.5%
    binary-trees:        45.3ms +/- 0.8%
    fannkuch:            67.0ms +/- 0.7%
    nbody:               28.7ms +/- 1.2%
    nsieve:              14.2ms +/- 4.0%

  ... and 19 more individual tests in 7 groups

Doing the multiplication, the 95% confidence interval works out to 1222.5-1229.9ms. What this confidence interval really means is rather technical.

First, SunSpider’s confidence intervals are based on a gaussian model of run time. The model assumes that the time taken on any given run is a random variable

T = m + E

where m (“mean”) is a constant “baseline” run time and E (“error”) is a random variable representing noise. The model also assumes that E is gaussian with mean 0. If the model is perfectly true in reality, then there is a strong mathematical interpretation of the confidence interval. But to the extent that the model deviates from reality, the confidence interval is less meaningful.

This brings up Question 1: Do SunSpider times vary with a gaussian distribution?

If the model is true, then over a large set of SunSpider runs, the baseline time will be in the confidence interval in 95% of runs. With the run I showed above, this seems to imply that, the baseline time is in the interval 1222.5-1229.9ms with probability 0.95, but according to the standard statistical interpretation, that interpretation is wrong, because 1222.5, 1229.9, and the baseline time are not random variables, so probability does not apply. But I don’t think the intuitive interpretation is that bad.

Question 2: In a large set of SunSpider runs, is the baseline time in the calculated confidence interval 95% of the time?

SunSpider’s significance tests. SunSpider can compare two sets of results, giving output like this:

TEST                   COMPARISON            FROM                 TO             DETAILS

=============================================================================

** TOTAL **:           *1.009x as slow*  1226.2ms +/- 0.3%   1237.3ms +/- 0.8%     significant

=============================================================================

  3d:                  *1.031x as slow*   176.6ms +/- 0.5%    182.1ms +/- 2.0%     significant
    cube:              *1.032x as slow*    43.8ms +/- 1.0%     45.2ms +/- 2.7%     significant
    morph:             ??                  34.5ms +/- 1.1%     34.8ms +/- 2.3%     not conclusive: might be *1.009x as slow*
    raytrace:          *1.039x as slow*    98.3ms +/- 0.7%    102.1ms +/- 3.4%     significant

  access:              ??                 155.2ms +/- 0.5%    156.3ms +/- 1.2%     not conclusive: might be *1.007x as slow*
    binary-trees:      *1.015x as slow*    45.3ms +/- 0.8%     46.0ms +/- 1.5%     significant
    fannkuch:          ??                  67.0ms +/- 0.7%     67.3ms +/- 1.8%     not conclusive: might be *1.004x as slow*
    nbody:             ??                  28.7ms +/- 1.2%     29.0ms +/- 1.2%     not conclusive: might be *1.010x as slow*
    nsieve:            -                   14.2ms +/- 4.0%     14.0ms +/- 2.4%

Differences are reported as “significant”, “not conclusive”, or blank, presumably meaning “not significant”. SunSpider’s “significant” means that the difference is significant at the 0.05 level according to the t test, which brings up more technical bits.

In general, significance tests are meant to separate observed differences that are real (the underlying baseline means are different) from random differences (the means are the same, and only the noise was different). In a basic significance test, we implicitly assume the means are the same, and then look to see if our observations tend to disprove that. If our observations would be very unlikely under that assumption, then we say the difference is significant. (c.f. if I think a certain roulette wheel is fair, and it comes up “13″ 10 times in a row, I have good reason now to think it’s not fair.) So, to say that two runs are different at the 0.05 level is to say: if the means are really the same, then there was only a 5% chance of getting runs that different. In other words, if I go around doing experiments looking for differences when there aren’t any, I’m going to get p=0.05-significant results 5% of the time.

Since SunSpider has 26 tests, this means if you do 2 runs of the same system, if the significance tests are applicable, you’d probably see 1 or 2 false “significant” differences in the results.

The t test is a significance test that applies when comparing two experiments with the gaussian model introduced above, where each experiment runs N trials and averages them. There’s a bunch of mathematical detail, but the basic idea is fairly simple. The idea is to construct a confidence interval not for a mean in one experiment, but for the difference in means in two experiments, say a 95% confidence interval. If that confidence interval doesn’t contain 0, then we’ve seen a result that’s pretty unlikely (<=5%) if the means are the same. So the difference is significant.

The mathematical formulas used in the t test are just to correct for the fact that we don't know the real standard deviation, just an estimate from our data. They're a more complex analogue to the mysterious "n-1" used in place of "n" in computing sample standard deviations and deriving 1-experiment confidence intervals.

The question for SunSpider is Question 3: If we run a large number of SunSpider comparisons of identical JavaScript engines, does SunSpider say “significant” about 5% of the time?

SunSpider in practice. Our everyday use of SunSpider is through a little script called bench.sh that runs one trial and prints out the total time. The idea is that a developer can run it a few times, make a change to TraceMonkey, and then run it a few more times and see if it seems to be better or worse. It’s supposed to be a quick test, taking 5 or 10 seconds to run and interpret, that can be run after any little change. So I want to know the answer to Question 4: What can we do with 3-5 SunSpider runs? How small of performance changes can it reasonably detect? What’s the best way to look at the numbers: average them, take the minimum, or something else?

Another use of SunSpider is to run over time, over our Mercurial history, to look for performance regressions that sneaked in with other changes. For that purpose, the runs will be automatic, so it’s OK if they take a while: we can do a lot of trials. So here, I want to answer Question 5: how many runs are needed to detect the smallest performance change that counts? To make that question real, we also need to know what kind of performance change counts. Theoretically, a 1ms slowdown is bad, because after 100 checkins like that, we’ve slowed down by 100ms, which definitely matters. But we seem to get changes of plus or minus 1-3ms “randomly” with many checkins just because gcc’s optimizer does something a little different. So I figure 5ms or so is a real difference that we’d like to detect.

Summary. I’ve laid out two broad areas I’d like to learn more about. First, I want to know if SunSpider’s statistical claims hold up in reality, and if not, where the deviation is. Second, I want to figure out the right number of runs to use, how to combine them, and how small of differences they can detect for a few important practical applications.

In Part II, I’ll start trying to answer these questions.

Comments

Comment from Charles
Time: June 15, 2009, 11:53 am

The mathematical formulas used in the t test are just to correct for the fact that we don’t know the real standard deviation, just an estimate from our data. They’re a more complex analogue to the mysterious “n-1″ used in place of “n” in computing sample standard deviations

Comment from Chiptuning
Time: September 11, 2010, 12:48 pm

significant

Comment from l arginine
Time: September 24, 2010, 8:34 pm

I like your blog,and also like the article,and thank you for provide me so much information :) )

Comment from Spray Foam Insulation contractors
Time: October 1, 2010, 8:43 pm

Nice site. Good job.

Comment from Gucci Aftershave
Time: October 1, 2010, 10:44 pm

The mathematical formulas used in the t test are just to correct for the fact that we don’t know the real standard deviation, just an estimate from our data. They’re a more complex analogue to the mysterious “n-1″ used in place of “n” in computing sample standard deviations

Comment from iphone sale
Time: October 6, 2010, 1:01 pm

Nice site. Good stuff.

Comment from grant for women
Time: October 6, 2010, 3:21 pm

Nice site. This is interesting

Comment from Playstation Move
Time: October 12, 2010, 5:23 am

I have to admit…i dont get this…

Comment from Unlocking Wii
Time: October 12, 2010, 5:26 am

i learned a lot…thanks!…

Comment from New York Sightseeing
Time: October 13, 2010, 4:00 pm

Very nice site.

Comment from Wii Unlocker
Time: October 24, 2010, 7:40 pm

Great site!

Comment from paktelecom
Time: October 27, 2010, 9:02 am

I like it very much because it has very helpful articles of various topics like different culture and the latest news.m ler anhing i found this nice website. Thanks for sharing.

Comment from pepper spray
Time: October 28, 2010, 8:47 am

Great stuff, very math oriented (yikes! :) )

Comment from casino en ligne
Time: October 29, 2010, 3:12 am

Nice site. Good job.

Comment from Car batteries charger
Time: October 30, 2010, 8:16 am

I like your blog, and like the article, and thanks for providing so much information) They must finish this project before the end of this year. But they must plan it well to avoid any trouble in it and aftersales.

Comment from Cheap Forex Systems
Time: November 1, 2010, 12:30 pm

This is a good research in Statistics, I was searching for some useful information about using Statistics to find the mean value of some variables and found this one. I didn’t get it this method 100% on my head but I’ll print it and read it again. Thank you.

Comment from build backlinks
Time: November 2, 2010, 12:00 am

I ike this site very much and also i like the article…

Comment from samsung galaxy
Time: November 3, 2010, 10:21 am

Great survey, I’m sure you’re getting a great response.

Comment from free ringtones
Time: November 7, 2010, 11:51 am

Thank you very much for this great post.

Comment from website design for free
Time: November 8, 2010, 10:53 pm

Thanks for posting the Sunspider Statistics. I badly need it!

Comment from Freelance writing work
Time: November 8, 2010, 10:56 pm

The survey speaks the truth. Thanks for posting this Statistics.

Comment from Oaks of Devonshire Homes for Sale
Time: November 8, 2010, 10:59 pm

The percentage is great! Thanks for this post.

Comment from Vanarama
Time: November 11, 2010, 2:04 pm

Excellent and very exciting site. Love to watch. Keep Rocking.

Comment from Dog Leash
Time: November 13, 2010, 12:34 am

Thanks for sharing this useful info.

Comment from Payday Loan
Time: November 15, 2010, 2:51 am

Thank you for the update, very nice site

Comment from fast cash payday loan
Time: November 19, 2010, 1:47 pm

Very in-depth demonstration. The figures were spot on. Appreciate the effort.

Comment from payday direct lenders
Time: November 25, 2010, 8:17 am

a Great survey, I’m sure you’re getting a great

Comment from Scrapbooking
Time: November 25, 2010, 4:06 pm

The blog was absolutely fantastic! Lot of great information which can be helpful in some or the other way. Keep updating the blog,looking forward for more contents…Great job, keep it up

Comment from star wars usb
Time: November 26, 2010, 6:55 pm

This is very interesting. Thank you for the explanation.

Comment from payday loans cash advances
Time: November 27, 2010, 2:40 am

Thanks for posting this for us.. I have heard about SunSpider. But, not familiar with it and didn’t know how it looks and works…

Comment from john
Time: November 27, 2010, 2:58 am

Very nice and interesting post.