There were two events recently that made me quite concerned.

First, I was looking through some of the results from the Dromaeo test suite and I noticed a bunch of zero millisecond times being returned from tests. This was quite odd since the tests should’ve taken, at least, a couple milliseconds to run and getting consistent times of “0” is rather infeasible, especially for non-trivial code.

Second, I was running some performance tests, on Internet Explorer, in the SlickSpeed selector test suite and noticed the result times drastically fluctuating. When trying to figure out if changes that you’ve made are beneficial, or not, it’s incredibly difficult to have the times constantly shifting by 15 – 60ms every page reload.

Both of these cases set me out to do some investigating. All JavaScript performance-measuring tools utilize something like this to measure their results:

var start = (new Date).getTime(); /* Run a test. */ var diff = (new Date).getTime() - start;

The exact syntax differs but the crux of the matter is that they’re querying the Date object for the current time, in milliseconds, and finding the difference to get to total run time of the test.

There are a lot of extenuating circumstances that take place every time a piece of code is run. There could be other things running in another thread, maybe another process is consuming more resources – whatever it is it’s possible that the total run time of a test could fluctuate. How much that test fluctuates is largely consistent, following somewhere along a normal distribution:

(Performance test suites like SunSpider and Dromaeo use a T-distribution to get a better picture of the distribution of the test times.)

To better understand the results I was getting I built a little tool that runs a number of tests: Running an empty function, looping 10,000 times, querying and looping over a couple thousand divs, and finally looping over and modifying those divs. I ran all of these tests back-t0-back and constructed a histogram of the results.

Here’s what the results look like for the major browsers on OS X:

The results here are terrific: There’s some clumping around 0ms (with some results spread to 1-4ms – which is to be expected) and a bunch of normal-looking distributions for each of the browsers at around 7ms, 13ms, and 22ms. This is exactly what we should expect, nothing out of the ordinary taking place.

I then fired up VMware Fusion to peek at the browsers running in Windows XP:

Huh. The results are much stranger here. There aren’t any, immediately, pretty clumps of results. It looks like Firefox 3 and Chrome both have a nice distribution tucked in there amongst the other results, but it isn’t completely obvious. What would happen if we removed those two browsers to see what the distribution looked like?

Wow. And there it is! Internet Explorer 8 (I also tested 6, for good measure, with the same results), Opera, Safari, and WebKit Nightly all bin their results. There is no ‘normal distribution’ whatsoever. Effectively these browsers are only updating their internal getTime representations every 15 milliseconds. This means that if you attempt to query for an updated time it’ll always be rounded down to the last time the timer was updated (which, on average, will have been about 7.5 milliseconds ago).

I was worried that these results were from the virtual machine (I also loaded up Parallels but saw similar results to running VMware) so I just loaded Windows XP proper:

Nope, the results are the same as using the VM.

Let’s think about what this means, for a moment:

Any test that takes less than 15ms will always round down to 0ms in these browsers. It becomes impossible to determine how much time the tests are taking with consistently zeroed out results. The error rate for any test run in these browsers would be huge. If you had a simple test that ran in under 15ms the error rate would be a whopping 50-750%! You would need to have tests running for, at least, 750ms before you could safely reduce the error overhead of the browser to 1%. That’s insane, to say the least.

What test suites are affected by this? Nearly all of the major ones. SunSpider, Dromaeo, and SlickSpeed are all predominantly populated by tests that’ll be dramatically effected by the error rate presented by these browser timers.

I talked about JavaScript Benchmark Quality before and the conclusion that I came to still holds true: The technique of measuring tests used by SunSpider, Dromaeo, and SlickSpeed does not hold. Currently only a variation of the style utilized by Google’s V8 Benchmark will be sufficient in reducing the error (since the tests are only run in aggregate, running for at least 1 second – reducing the error level to less than 1%).

All of this research still left me in a rough place, though. While I now knew why I was getting bad results in Dromaeo I had no solution for getting stable times in Internet Explorer. I did a little digging, tried a couple more solutions, and stumbled across ies4osx. Ies4osx is a copy of Internet Explorer 6 running in Wine, running in X11, on OS X. It works ‘ok’, although I’ve been able to get it crash every so often. Disregarding that, though, it’s stable enough to do testing on.

Running the numbers on it yielded some fascinating results:

ies4osx provides some surprisingly stable results – we even have something that looks like a normal distribution! This is completely unlike the normal version of IE 6/8 running on Windows. It’s pretty obvious that the Wine layer is tapping into some higher-quality timer mechanism and is providing it to IE – giving us a result that is even more accurate than what the browser normally provides.

This is fantastic and it’s dramatically changed my personal performance testing of Internet Explorer. While I’m not keen on using anything less than “IE running on XP with no VM” for actual testing – this layer of higher-detailed numbers has become invaluable for testing the quality of specific methods or routines in IE.

In Summary: Testing JavaScript performance on Windows XP (Update: and Vista) is a crapshoot, at best. With the system times constantly being rounded down to the last queried time (each about 15ms apart) the quality of performance results is seriously compromised. Dramatically improved performance test suites are going to be needed in order to filter out these impurities, going forward.

Update: I’ve put the raw data up on Google Spreadsheets if you’re interested in seeing the full breakdown.