Monday, November 28, 2016 at 8:56AM

This is a guest repost Barzan Mozafari, an assistant professor at University of Michigan and an advisor to a new startup, snappydata.io, that recently launched an open source OLTP + OLAP Database built on Spark.

Almost everyone these days is complaining about performance in one way or another. It’s not uncommon for database administrators and programmers to constantly find themselves in a situation where their servers are maxed out, or their queries are taking forever. This frustration is way too common for all of us. The solutions are varied. The most typical one is squinting at the query and blaming the programmer for not being smarter with their query. Maybe they could have used the right index or materialized view or just re-write their query in a better way. Other times, you might have to spin up a few more nodes if your company is using a cloud service. In other cases, when your servers are overloaded with too many slow queries, you might set different priorities for different queries so that at least the more urgent one (e.g., CEO queries) finish faster. When the DB does not support priority queues, your admin might even cancel your queries to free up some resources for the more urgent queries.

No matter which one of these experiences you’ve had, you’re probably familiar with the pain of having to wait for slow queries or having to pay for more cloud instances or buying faster and bigger servers. Most people are familiar with traditional database tuning and query optimization techniques, which come with their own pros and cons. So we’re not going to talk about those here. Instead, in this post, we’re going to talk about more recent techniques that are far less known to people and in many cases actually lead to much better performance and saving opportunities.

To start, consider these scenarios:

Scenario 1 (Exploratory Analytics). You’re an analyst slicing and dicing your data looking for insight, patterns or testing various hypotheses you have about your business, customers, or services. In these situations, you don’t usually know what you’re exactly going to see. You run a query, look at the results, and then decide if you need to run a different query. In other words, you engage in a series of exploratory and adhoc queries until you find what you want. Only some of the queries that you run are going to be used to populate a company report, fill out a form, or generate a plot for your customer. But each time you submit a query, depending on your data size and the number of other concurrent queries in the cluster, you may have to wait several minutes to get an answer. While waiting, you’re usually idle and may not be able to decide on your next query because it depends on the output of your previous query, which you are still waiting on!

Solution: While waiting, you can immediately see an “almost perfect” answer. What do I mean by an “almost perfect” answer? Compare the two plots below.

These are the outputs of the same BI tool after running a query that loads data from a backend database. The one on the right takes 71 minutes to finish as it’s using all 1B data points, whereas the one on the left uses only 1M data points but finishes in only 3 seconds! Of course the plot on the left is slightly blurry compared to the final version on the right. But think about it: isn’t it worth it? Instead of waiting 71 mins, you get an immediate answer that is almost identical and then can decide if you want to wait another 71 mins for the full answer or just move on with your life! I can’t imagine anyone not being able to decide whether the full version is worth the wait by looking at the 3-sec version!

Is this is a new idea: certainly not! In fact, all web-browsers already do this. Next time you try to load a high-resolution image on your browser pay attention to how your web browser first tries to load and display a rough image which gradually gets finer and finer. But applying the same principle in a database and SQL query processing is far less known.

So you might have a few questions at this point: how can we achieve such a speedup in practice? Does this work even if your data distribution is skewed? Do you still see the outliers? Do you need to use a specific database to enjoy this type of tradeoff between speed and accuracy? I will hopefully answer all these questions towards the end of this post, but I want to first tell you a few more scenarios in which you might find the same idea appealing: seeing an immediate result that is 99.9% accurate but 200x faster!

Scenario 2 (Overloaded Cluster). Like most database users today, you may not have a dedicated cluster to yourself. In other words, you’re sharing a cluster with your team, or even other reporting and BI tools that fire up SQL queries to the same shared pool of database resources. When these shared clusters are overloaded, only one of three things can happen in practice:

A. Global frustration. You do nothing, and let everyone suffer equally. In other words, once the database queues are backed-up and the CPU cores are maxed out, no one can get their queries finished fast enough.

B. Partial frustration. You can do something smarter by terminating or suspending lower-priority queries to let the more urgent queries (e.g., those going to your manager!) finish first. In other words, you keep a few more important guys happy but make everyone else angry!

C. If these situations happen often enough, you may buy more and bigger servers or migrate to the cloud and spin up more nodes on demand. Of course this option costs $$$ and is inconvenient. Plus, it’s usually not a short-term solution.

One thing that most people don’t realize is that there is a forth option that is better than the first two options and unlike the third option doesn’t cost you money either. So what’s that option? That forth option is the following:

Return 99.9%-accurate answers for low-priority queries, and 100% for the rest! Again, this is possible because according to laws of statistics you can usually get a 99.9% accurate answer by using only 0.1% of the data and processing. This is why sacrificing 0.1% accuracy can mean 100-200x speedup in practice. I know no one wants to be the one settling for using 99.9% accuracy, but you need to consider the other options: having your query terminated, put on a long waitlist, or staying idle waiting for your query to finish. As I mentioned under Scenario 1, in many cases you can still get the full answer and just use the 99.9%-accurate answer while you’re waiting for the full answer. I will get back to the “how” question at the end of this post. But for now, remember that 99.9% accuracy doesn’t mean you miss 0.1% of your output. You still see everything usually but they actual numbers might be off by 0.1%, which means in most cases you can’t even tell the difference unless you really squint. Compare these two plots

These are the output of a query that asks the total trip time to downtown using the famous NYC cab dataset

Can you tell which one is the 100% accurate answer and which one is the 99.9% accurate one? To most humans, these two are identical. But the top query took only 1.7 seconds while the bottom one took 42.7 seconds. This means by sacrificing 0.1% accuracy, you saved 25x on your CPU time! Let’s talk about one last scenario and then I promise I will tell you more about the “how” part.

Scenario 3 (Machine Learning and Data Science). If you’re a machine learning guru or a data scientist, you often find yourself training statistical models, parameter tuning, or just doing feature selection and engineering. One of the most common frustrations here is the massive number of parameters or features that you need to try out and the long time it takes to train machine learning models. The clusters are constantly busy training and testing different models, and this limits the set of different models and parameters that data scientists can try, or at least slows down the process.

In many applications, you can make perfectly reasonable decisions without perfectly accurate answers. A/B testing, root-cause analysis, feature selection, visualization, noisy data, or datasets with missing values are all situations where this statement is certainly true. The only time you don’t want to do this is when you’re working in a billing department!

I will probably do a separate post on how to speed up parameter tuning or feature selection.

But “how” can we speed up our queries 200x but only sacrifice a tiny bit of accuracy?

The answer lies in a technique called Approximate Query Processing or AQP for short. There are different ways of doing AQP but the simplest solution is using a random sample of your tables. It turns out, if your data is skewed, you don’t want to use a random sample as it will miss most of the outliers and rare items that might exist in your original table. So what’s more common in practice is what’s called a “stratified sample”. What is a stratified sample? Consider the following table:

Let’s say you want to run the following query on this table:

SELECT avg(salary) FROM table WHERE city = ‘Ann Arbor’

Of course you can run this query, but if this table was 100’s of millions of rows, or it was partitioned across multiple machines, it could take several minutes to finish. Instead, you might decide to run it only on a random (a.k.a. uniform) sample of it, say:

As you can see, since Ann Arbor tuples were quite rare compared to NYC tuples in the original table, chances are you might see only a few of them or none at all in your sampled table. Instead, a stratified sample would first stratify (i.e., partition) the table based on the City and then sample the rows of each city at random:

ID City Age Salary Sampling Rate 3 NYC 67 62,492 1/4 5 Ann Arbor 25 120,242 1/2

Without getting into lot of statistics, it turns out that a stratified sample like this will ensure that you get very high accuracy by processing a very small fraction of the original tuples.

Now the next question is how do you do create these samples? And how do you measure the accuracy? I have written a book chapter about this which you can read and learn how to do this yourself. But there are automated ways of doing this too. There are several products out there which you can use to hide this complexity from you, so you can just press a button and they do it for you A-Z by returning extremely fast answers and some of them even give you knob which you can turn to decide how much accuracy to trade for how much speedup:

BlinkDB / G-OLA

Although there were many AQP proposals out there, BlinkDB was perhaps the first distributed (massively parallel) AQP engine that became an open-source project. In the spirit of full disclosure, I was involved in this project, and so I may be biased here since I really liked BlinkDB’s approach and I think it really shaped many of the academic and industrial proposals that came after. The work on BlinkDB was continued by Databricks (the company commercializing Apache Spark). A while ago, Databricks announced an extension to BlinkDB which would incrementally refine its answers until the user is satisfied. This extension was called G-OLA. G-OLA has never been publicly released and BlinkDB has not been updated for a long time now.

SnappyData

SnappyData is an open source in-memory hybrid analytics platform that offers OLTP, OLAP and Streaming all in a single engine. The database engine itself is built by extending Apache Spark (and is hence fully compatible with Spark). The in-memory core also offers AQP through curated stratified sampling and other probabilistic structures. Its query syntax is similar to BlinkDB, which allows users to specify their desired accuracy. This means you can treat accuracy as a dial. For example, you can ask for 100% if you want an exact answer (which is the default behavior), but if you want a quick answer, you can also get a 99% accurate answer within a second or so. In my opinion, a key advantage of SnappyData is that it uses curated stratified samples. What this means is that you can run your analytical queries in a couple of seconds even if you’re querying terabytes of data, running queries on a laptop, or are running in a shared cluster with tens of other concurrently queries. It also has built-in support for streaming, which allows you can built samples and have them updated in real-time in response to incoming streams.

Another nice feature in SnappyData is that it comes with a number of high-level user interfaces, which means you do not need to be proficient in statistics to use its AQP features. For example, they now have a free cloud service, iSight, that uses Apache Zeppelin as a front end to visualize query responses instantly while running the full query in the background.

Disclosure: I am an advisor to SnappyData.

Presto

Facebook’s Presto had an experimental feature for approximating basic aggregate queries. I don’t really know if this feature is now in their latest release or not, but the downside was that you would have to use a different syntax (i.e., modify your SQL query) to be able to invoke those approximate aggregate features. This is troublesome if you have existing BI tools or applications that are not using this special syntax, as they won’t be able to benefit from the potential speedup unless they’re re-written to use this new syntax.

InfoBright

InfoBright offers approximate query features (called IAQ). Unlike other systems, IAQ does not use samples at all. Unfortunately there is not much published about the details of how their approximate features work and what accuracy guarantees they offer, but from reading their blogs, I believe they are building models of the underlying data and using those to answer queries instead of using samples. Again, I don’t know much more about IAQ as they’re not open-source and I wasn’t able to find many details on their website, but they sound like an interesting approach.

ABS

Analytical Bootstrap System (ABS) was another approximate query engine that would use samples and an efficient statistical technique for estimating the error. The latest code is a bit outdated and only works for earlier versions of Apache Hive. The project is not currently active.

Verdict

Verdict is a middleware that sits between your application or BI tool and your backend SQL database. You can simply issue the same queries on the existing database as before and get approximate answers right away. In principle, it is possible to use Verdict with any SQL database, which means you won’t be restricted to any specific DBMS. But currently, it only comes with drivers for Spark SQL, Hive and Impala. The advantage is that it is generic and can be made to work with any SQL database and that is it open-source, and the downside is that since it’s a middle-ware it probably isn’t as efficient as some of the commercial offerings like InfoBright or SnappyData.

Disclosure: I was the designer for Verdict.

Oracle 12C

Oracle 12C has introduced approximate count distinct and approximate percentile. These approximate aggregations improve performance and use less memory in their computation. Oracle 12C also provides materialized view support so that users can even pre-compute their approximate aggregates. Although approximate percentiles and count distinct queries are quite useful and common in practice, there is no general support for other types of queries. But given Oracle’s last user base, even these limited features will benefit a lot of users. Although, to the best of my knowledge, many other database vendors have long supported approximate count distinct queries (e.g., using HyperLogLog algorithm). Here’s a paper for those of you who are interested in learning more about these new features in Oracle 12C.