Summary: JavaScript Benchmarks aren’t adapting well to the rapid increase in JavaScript engine performance. I provide some simple tests for verifying this and propose a modified setup which could be used by all JavaScript Benchmarks to achieve high-quality results.

There now exists three, what I would consider to be, major JavaScript performance benchmarks. Each are released by a major browser vendor. WebKit released SunSpider, Mozilla released Dromaeo, Google released the V8 Benchmark.

Each suite has a variety of tests and a test runner. I’m currently interested in one thing: The quality of the test runner that each of these suites provides.

There are three points that any test runner tries to achieve:

Retrieving accurate numbers. Another way to phrase it: “Retrieving stable numbers.” If you run the test suite multiple times will you get identical, or near-identical, numbers? Reducing the possible error. Does the suite attempt to quantify how much possible error there could be in their results? How large is the error? Reducing run time. How long does it take to run each test?

The ideal suite would be one that’s capable of running an individual test as quickly and accurately as possible with virtually no error. However, in order to get those numbers you need to carefully chose what style of tests you wish to run.

I quantify the current field of tests into three categories:

Slow-running tests: These tests generally take a couple hundred milliseconds on an average consumer machine. This is the style of tests that you generally see in SunSpider and occasionally in Dromaeo. These tests have a distinct advantage: They are generally quite accurate. You don’t need to run them very many times (say about 5) in order to get a consistent picture of how the test will run.

Moderate-running tests: These tests take less than one hundred milliseconds. You’ll see these tests in Dromaeo. These tests need to be run quite a few times (20-30) in order to get a better picture of their speed.

Fast-running tests: These tests take less than ten milliseconds. You’ll see these tests in the V8 Benchmark and in Dromaeo. These tests must be run many, many, times (usually close to a thousand) in order to weed out any possible error. If you were to run one 10ms test 5 times – and get a single result of 11ms that would introduce a significant level of error into your results. Consider tests that take 0-1ms. A deviation within that range can instantly cause error levels around 50% to occur, if enough test iterations aren’t completed.

Looking at the above categories the solution seems obvious: Use slow-running tests! You get to run them fewer times and you get accurate results – everything’s peachy! But here’s the problem: The speed of JavaScript engines are increasing at a rate faster than test suites can adapt. For example, SunSpider was initially developed with all tests running at equal speeds on a modern computer. Now that speed improvements have come along, though, (WebKit improvements, then Mozilla, then SquirrelFish, then TraceMonkey, then V8) those results don’t even remotely resemble the tests of old. Most of the results have moved down into the moderate-running range of tests, some even into the fast-running range – but here’s the problem: They’re still only running the originally-designed number of loops. An example of the difference:

This means that a browser is running a test for 5-10 loops (in both SunSpider or Dromaeo) but the speed of the test no longer matches that assigned number of iterations. At first glance you could say one of two things: 1) Increase the difficulty of the tests or 2) Have them run for more iterations. There are problems with this, though.

While you certainly could increase the complexity of existing tests the result would be a never-ending battle. Improvements would have to land every couple months in order to keep up with the pace of improvement. This would work ok in Dromaeo (since it has versioned tests) but not all suites can handle this behavior. Additionally this now makes the tests less-useful for measuring slower browsers (the tests now take an absurd amount of time to complete as opposed to the very-plausible numbers from before).

Additionally, you could increase the number of test iterations that would occur but not without assigning specific iteration counts to each individual tests. And this is the full problem: How do you know what numbers to choose? Raising the number to 20 iterations may help one browser – but what about another browser which will need 100 iterations to get a proper count?

This leaves us in a bind. Browsers keep getting faster at tests, test suites do the wrong number of iterations, causing the error level to continually increase:

We should take a step back and look at what the test suites are doing to counter-act the above trend from happening – if anything at all.

SunSpider was originally dominated by long-running tests running 5 times each. The tests use to be long-running but are now only in to the medium to fast-running range (depending on the browser). This has caused the accuracy to decrease and error level to increase. Increasing the number of iterations would help (but hinder older browser performance).

Dromaeo has a range of tests (fast, moderate, and long-running) each running 5-10 times each. Dromaeo attempts to correct the number of iterations run, right now, but kind of fails when doing so. It looks at the results of past iterations (especially the error level generated by the results) and decides to run more tests until a stable error level is achieved. The problem with this is the samples are no longer being independently determined. Whereas test runs 1-5 were independent, test runs 6-10 were not (they’re only being run due to the fact that previous test runs provided poor results). So while the results from Dromaeo are hyper-stable (they’re the most stable performance test that we run at Mozilla) they’re not determined in a proper statistical manner. Thus Dromaeo needs to be changed in order for people to be able to gather accurate results without sacrificing its statistical integrity.

The V8 Benchmark takes a completely different strategy for its fast-running tests: Instead of running for a pre-determined number of iterations each test is run continuously until a second of time has passed. This means that individual tests frequently run anywhere from 50-200 times (depending on the machine and browser they run on). Currently the V8 Benchmark does suffer from one shortcoming: There is no error calculation done. Both SunSpider and Dromaeo fit the results to a t-distribution and compute the possible error of the results whereas the V8 Benchmark just leaves them as is.

However, the V8 Benchmark does bring up a very interesting strategy. By picking tests that are simpler (and, arguably, most current complex tests will become “simple” as engines rapidly improve) and running them more times (relative to the complexity of the test) the results become much more stable.

Consider the result: Fast-running tests end up having a smaller error range because they are able to run more within a given allotment of time. This means that the test runner is now self-correcting (adjusting the number of iterations seamlessly). Since all JavaScript engines are getting faster and faster, and complex tests are running in shorter and shorter amounts of time, the only logical conclusion is to treat all tests as if they were fast tests and to run them a variable number of times.

We can test this hypothesis. I’ve pulled together a simple demo that tests the computational accuracy of the three styles of test (Simple – or Fast-running, Moderate, and Complex – or Slow-running) against the two different types of suites: “Max” style (the style implemented by the V8 Benchmark) and “Fixed” style (the style implemented by SunSpider and Dromaeo). The results are quite interesting:

Fast and moderate-speed tests are incredibly accurate with the max-style of test running. Often their error is no more than 2% (which is really quite acceptable). The quality degrades for the complex tests, but that’s ok. Complex tests could be tweaked to consume less resources (thus allowing them to iterate more times and become more accurate).

Right now the accuracy of slow-to-moderate tests in fixed run test suites are suffering from increased rates of error as the engines get faster (as can be seen in the above chart).

The major change that would need to be made to the flexible style of execution (at least the one implemented in the V8 Benchmark) would be some form of error checking and t-distribution fitting. In V8 a series of tests are run over the course of one second. I propose that the resulting number (the total number of tests executed) be saved and the tests are repeatedly run again (perhaps 5 times, or so). The resulting set of 5 numbers (representing, potentially, thousands of test runs) would then be analyzed for quality and error level. I have this measure completely mocked up in my sample test measurement.

The relevant portion of the code is here:

function runTest(name, test, next){

var runs = [], r = 0;

setTimeout(function(){

var start = (new Date).getTime(), diff = 0;

for ( var n = 0; diff < 1000; n++ ) { test(); diff = (new Date).getTime() - start; } runs.push( n ); if ( r++ < 4 ) setTimeout( arguments.callee, 0 ); else { done(name, runs); if ( next ) setTimeout( next, 0 ); } }, 0); }[/js] Switching to a flexible style of text execution has many benefits:

It’ll help to keep the rate of error small for most tests (especially if the tests are tailored to be smaller in nature).

It’ll allow the suite to gracefully adapt over time, as engines speed up, without sacrificing the ability to work on old engines.

The maximum number of runs (and, thus, the maximum amount of accuracy) will occur for the engines that are able to complete the tests faster. Since the greatest amount of competition is occurring in these high numbers (as opposed to in older user agents, where there is no progress being made) granting them the most accurate numbers possible makes this ideal.

I plan on prototyping this setup into Dromaeo and releasing it shorting. It’ll take much longer to run the full test suite but the result will be quite worth it.