By Richard Heyns: In a recent benchmarking by Mark Lintwintschik, Brytlyt was the fastest database of all the databases tested. Mark evaluates databases using his billion row taxi dataset. When the benchmark was made public we started getting a lot of questions from Apache Spark users. After all, Apache Spark is one of the most well-known and popular databases in the market and liked and loved for what it can do.

I wanted to share some insights on how Brytlyt ended up being between 190 and 1,200 times faster than Spark and why this is relevant to analysts today.

While on first impressions this appears very impressive, it is only after understanding the benchmark itself that an analysis of the results can be made.

The core difference between Spark and Brytlyt

Spark was developed in response to limitations in Hadoop, which reads input data from disk. Spark improves on this by using a form of distributed shared memory which in turn has a massive impact on performance.

Brytlyt takes this in-memory approach a step further and uses Graphics Processor Units (GPUs) to add a further performance boost. Considering a GPU server like the one used in the benchmarking can read data at 3.8 terabytes per second while standard computer memory can only be accessed at around 60 gigabytes per second, you get an idea of the difference using GPU hardware can have on query performance. The difference in hardware performance alone points to an improvement in performance in the order of 60x.

Why performance matters

In a recent survey of analysts, 30% said they were constrained by slow performance. Because so much time is taken up preparing data, only three days a month is actually spent mining for insight. Incredibly, almost 40% of insight takes more than a week to complete. For those of you who are not analysts, consider your Google experience. Two things contribute to the value Google has for the average user – an accurate result and close to instantaneous response. Imagine what it would be like if, instead of an instantaneous response, it took one minute for Google to return an answer to a search. You would probably still use Google, but the way you use it and the value gained would be very different. Think of the reality for most analysts, which is that queries take tens of minutes if not hours to complete.

It is easy to understand why everyone wants a faster, more responsive user experience. All things being equal, it is the cost and effort to deliver that experience that should be evaluated.

Background on queries and data used in the benchmark

Mark’s benchmark runs four aggregation queries on a 1.1 billion row table containing New York taxi pickup and drop off data. In brief, the amount of data used is significant though not huge. The complete details of Mark’s benchmark and query types can be found here.

In my opinion, the queries are relatively simple, although they do run large aggregations and grouping on high row counts. Also, there is only a single table so joining performance is not evaluated. Joins would be an interesting addition to this comparison as a common issue with Spark is poor join performance.

Brytlyt was between 190 and 1,200 times faster depending on the query. Two GPU machines (p2.16xlarge) were used for the Brytlyt cluster while eleven machines (m3.xlarge) were used for the Spark cluster. Everything was done on Amazon Web Services and the total hardware cost per hour for Spark was $2.9/hour compared to $28.8/hour for Brytlyt.

The results with query run time in seconds

Q1 Q2 Q3 Q4 Instance Type Instance Cost ($/hr) Machine Count Total Cost ($/hr) Brytlyt 0.009 0.011 0.103 0.188 p2.16xlarge 14.4 2 28.8 Spark 10.79 8.134 19.924 85.942 m3.xlarge 0.266 11 2.926 Improvement 1,199 739 193 457 9.84

Coming out faster and cheaper after normalizing results using cost

Today, analytics is often automated with a broad range of scenarios being evaluated. Both absolute performance, as well as the cost of running through all the scenarios, is now more important than ever. In other words, analysts want the holy grail of getting their answers immediately while not having to also pay through nose.

As you may agree, achieving great absolute performance is relatively easy when it comes to solutions that can scale out horizontally. Typically, the runtime can be cut in half simply by doubling the amount of hardware.

But the way to do an apples-to-apples comparison is to also look at the cost. Using cost to normalize performance also gets around the problem of comparing a GPU based solution like Brytlyt to a traditional CPU based solution like Spark.

In this benchmark, the Brytlyt cluster uses hardware that is 9.84 times more expensive. This information can be used to normalize the results. Effectively, even if the Spark cluster was running one hundred and eight machines compared to the two Brytlyt GPU servers, Brytlyt would still be between 20x and 122x faster.

Quite simply, the more questions that are asked the greater the value of insight produced. Which is why absolute query performance along with query cost is so important. Where value generated is directly correlated to the number of queries run, Brytlyt is not only significantly faster than Spark, it actually comes out significantly cheaper too.

In conclusion, the queries and the dataset used in this benchmark are about evaluating performance. In reality, analytics demands queries that are far more complex and often include joining tables of information together. The MapReduce paradigm on which Spark is based is not suited for fast join performance and I would be interested in evaluating queries that include joins in follow-up benchmarking.