If you're interested in trying out new technologies but you don't want to just blindly follow the latest fads, you probably keep an eye out for substantive reports on the technologies you're interested in. In the software world a lot of claims of merit boil down to performance as demonstrated in benchmark reports. Unfortunately, even when published by reputable firms these reports can be incomplete, misleading, or flat out wrong. I was prompted to write this by a recent performance claim from dgraph, but the problem is a longstanding one in the software industry.

The most obvious problem with the dgraph article is the underlying work was poorly researched; they had performance problems with Facebook's RocksDB so they decided to write their own database engine. Their resulting project, badger, soundly beat RocksDB's performance and they decided this was a huge win. The reality of course is that RocksDB is an extremely heavyweight, inefficient storage engine and it's not much of a challenge to beat it, as we've thoroughly demonstrated with LMDB. I left feedback commenting as such.

Eventually they got back to me and I helped them integrate LMDB into their testing framework. The results weren't surprising, LMDB ran rings around both badger and RocksDB. But the interesting part to me was a flaw in their test code, one I had seen a few times before. They populate a DB to generate e.g. 10 million records and then measure the performance for retrieving randomly selected records from this DB. It's a very common benchmarking scenario. Unfortunately, their random number generator is weak and repeats a significant portion of results. In a typical run 37% of the results are duplicates. This means they're actually only testing with a DB of 6.3 million records instead of the 10 million they specified. And the subsequent read test only covers 63% of that database, instead of 100% of the actual database.

The Google LevelDB authors made this same mistake in their benchmark code, which I pointed out to them and fixed back in 2012. Lesson #1 - you must pay attention to detail not only when developing your main code, but also when developing your tests. And as a member of the general public reading the test results, you must likewise pay attention to detail to see if the numbers add up, because you can't rely on the authors to have done their due diligence. Here we have two database providers publishing results for tests that didn't actually measure what they claimed to have tested.

Another example of inadequate test design came up with an LDAP benchmark published by Oracle back in 2008 where they made bold claims about testing with 2 billion records, as if it was a world's first. (In fact, we had already tested a 5 billion entry directory for another Symas customer earlier that year.) Again, an obvious problem - they reported a search rate of 101,800 searches/second across this 2 billion entry DB, but only ran the test for 30 minutes. In 30 minutes, assuming every random reference was unique, they could only have hit at most 183 million records - less than 10% of the specified data set. Lesson #2 - do the math. Don't just take impressive looking numbers at face value; the picture they paint may be incomplete at best. (I went into greater depth on that Oracle report on the old Symas blog. There were lots of problems with the test protocol they used.)

A lot of benchmarks you'll see aren't even reproducible because the report doesn't provide enough details about the test environment or how to set it up. But even when the results are reproducible, they're meaningless if the procedures don't actually test what they claim to test. A good report will not only explain the setup in detail, but will also analyze the results and explain why they are what they are. For example, when benchmarking LMDB, we explain exactly why the results are correct and sensible (e.g. in this case, the observed latencies are equal to the hardware's latency specifications).

Conclusion - be careful out there when looking for hard numbers on the latest technology fad. Even when published by a large, well established firm, the results are often misleading. (Probably not intentionally, either. People switch technologies so quickly they seldom spend enough time with any single one to understand it thoroughly. All of the problems cited here are obvious cases of people who didn't understand what they were working with.) In this age of get rich quick schemes, there's bound to be even more that are intentionally misleading. Caveat emptor.