This blog is about Data Warehouses. Namely, it’s a critique of how technologies are being evaluated today in the analytics industry. What better way to do this than to resurrect a cliche and draw a comparison with sports!

How NBA evaluates talent

Every spring top basketball prospects in the world prepare for the NBA draft. Athletes attend workouts to quantify ability and talent. Athletes are put in challenging situations aimed to replicate real-world scenarios, and their performance is observed. Key is to measure performance under conditions not present in a typical shoot around — effects of a loud crowd, an aggressive physical defense, pressure of a big moment, late-game fatigue, injuries, and so on.

Only secondarily is an athlete’s potential quantified by his or her ability to shoot free throws, layups, and three pointers in an open gym under no duress. Imagine the world in which this was the only universally-accepted way to evaluate athletes — players would practice nothing but free throws, layups, and three pointers! After a few decades, the system would be perverted towards players who are likely mediocre basketball players, but are very good at shooting free throws, layups, and three pointers in an open gym. The NBA would become a snooze!

Sorry Stephen Curry, your situational awareness, legendary performance under pressure, stamina, and on-court leadership are nice, but you’re just really not empirically good at basketball! Yes, LeBron James, there’s never been an athlete like you, but can you really hang in the NBA with THAT free throw metric? No, Michael Jordan, we don’t have room for a shooting guard who does tomahawk dunks instead of layups.

How Data Warehouses are evaluated

Then why are our Data Warehouse evaluations stuck in the old theoretical cookie cutter world? Today’s TPC-H and TPC-DS benchmark are the data warehouse equivalent of shooting free throws, layups, and three-pointers in an open gym. They do a fantastically mediocre job of telling technologists what’s actually going to happen in the real world. They don’t measure data warehouse stamina, fatigue, court vision, and performance under pressure!

Worse yet, Data Warehouses have had decades to grease the skids on the very exact patterns TPC-* benchmarks look for, at expense of overall resiliency and practical benefit. Incentives are perverse indeed — if given a choice to improve overall performance by, say, 10%, versus improving TPC-DS relevant benchmarks by 20%, it’s in the vendor’s best interest to forego customer benefit and aim for the benchmark!

Data Warehouses are uniquely diverse in several ways:

A data warehouse typically runs a number of competing workload types (for ex, ELT, BI, ad-hoc, and ML), each with their own idea of maximized utility and processing SLOs. It’s hard to reason about a data warehouse as a whole, versus each workload type individually.

Data warehouses are exceedingly diverse in their architectures. Scale-out, scale-up, no-scale; denormalization, normalization. Columnar and in-memory; open and closed. Stateless, serverless, or server-bound. Et cetera.

Pent-up demand for analytics is challenging to measure, and is rarely fully realized. Different data warehouses treat pent-up demand differently.

Data Warehouse storage systems especially behave differently under different conditions:

Most will struggle at high IOPS and throughput demands.

Most will struggle while Terabytes of data are batch-ingested into the system.

All will struggle while millions of rows are streamed into the system (if at all possible).

Some will struggle when their scratch space gets low.

With most, performance will degrade over time, and storage maintenance becomes a necessity — in form of vacuum, index rebuild, or re-clustering.

Data warehouses that rely on local storage will inevitably be limited in local storage scalability, and will face a performance or functionality cliff.

Data warehouses often require compute to scale with storage.

It’s nearly impossible to uncover these rough edges in a basic evaluation, especially in services that don’t keep a habit of openly documenting their quotas and limitations.

In short, just like with athlete evaluations, you won’t really know how your data warehouse works looking at basic TPC-style benchmarks. None of the typical idiosyncrasies are captured, and many Data Warehouses are specifically optimized to look good in such scenarios, at expense of real-world capability.

A modest suggestion

Perhaps the industry should evaluate data warehouses the way NBA looks at prospects — by putting them through the ringer, and trying to figure what the technology does under adverse conditions. How about shooting free throws after 3 hours of high-intensity 5-on-5 drills?

So what are we proposing here? Let’s create an evaluation with some fatigue, some pressure, and and perhaps even a nagging injury or two! For example, one could evaluate query performance of a data warehouse:

Free throws — under static conditions, with minimal concurrency (status-quo)

— under static conditions, with minimal concurrency (status-quo) Sprints — under increased concurrency

— under increased concurrency 3 vs 3 drills — while data is being batch-ingested into the data warehouse storage system, to test how well the storage system withstands real-world read/write contention, as well as consumption of resources used for ingest.

— while data is being batch-ingested into the data warehouse storage system, to test how well the storage system withstands real-world read/write contention, as well as consumption of resources used for ingest. 5 vs 5 drills —while data is streamed into the data warehouse storage system, for nearly the same reasons as the above.

—while data is streamed into the data warehouse storage system, for nearly the same reasons as the above. Fatigue — after a significant amount of data has been ingested into the storage system, to test how data locality degrades with time and its impact on performance.

— after a significant amount of data has been ingested into the storage system, to test how data locality degrades with time and its impact on performance. Pressure — During routine maintenance, if you’re lucky to have a data warehouse that doesn’t require you to have a maintenance window.

— During routine maintenance, if you’re lucky to have a data warehouse that doesn’t require you to have a maintenance window. Court vision — While storage optimization processes are being ran — vacuum, index re-build, automatic re-clustering, compaction, and storage resizing.

Now, of course we’re not saying that benchmarks are entirely useless. They do provide a useful baseline to draw from. However, many industry observers have taken these benchmarks to mean far more than that. You cannot blame them — it is much easier to evaluate data warehouses in a controlled fashion, versus in conditions that estimate real-world scenarios!

Edit: Mosha Pasumansky, one of lead engineers on BigQuery, pointed out that TPC guidelines do include measuring query performance while simultaneous inserts, updates, and deletes are in progress. This only aids in our argument that folks should be doing these types of tests, versus cookie-cutter nothing-changes single-concurrency power runs!

Edit2: Mosha Pasumansky is the Principal Engineer for BigQuery and BI Engine. I am terribly sorry for the honest mistake :)