Will Athena slay BigQuery?

*Updated on 16th December 2016 – see below

With the announcement of Amazon Athena at this year’s AWS re-invent conference, I couldn’t help but notice its striking similarity with another rival cloud offering. I’m talking about Google’s BigQuery. Athena is a managed service allowing customers to query objects stored in an S3 bucket. Unlike other AWS offerings like Redshift, you only need to pay for the queries you run. There is no need to manage or pay for infrastructure that you may not be using all the time. All you need to do is define your table schema and reference your files in S3. This works in a similar way to BigQuery’s federated sources which reference files in Google Cloud Storage.

Given this, I thought it would be interesting to compare the two platforms to see how they stack up against each other. I wanted to find out which one is the fastest, which one is more feature rich and which is the most reliable.



Let’s get Started

I thought I would start by comparing the two services by doing a simple aggregation. For all my tests I will be using the publically available Github datasets. I started with the license table, it contains all of the open source GitHub repo names and their license name. It’s 3.1 million records in a 90MB file. For the first comparison I decided to use the following query:

SELECT license, COUNT(*) AS licenses FROM `bigquery-public-data.github_repos.licenses` GROUP BY license ORDER BY licenses DESC

I ran it in Bigquery first and as I expected it was really fast. It was able to complete the query in 2.8 seconds. Next up was Athena. I wasn’t really sure what to expect here, unlike BigQuery, Athena does not have the concept of native tables. It’s backed by an S3 bucket which I was expecting to slow the query down. So, I ran the query and to my surprise, the results were returned in 5.32 seconds. That’s 5.32 seconds to load in 3.1 million records from a file in S3, group the records and then sort them. I have got to say I was really impressed. However, it’s still 2.52 seconds slower then BigQuery.

External Tables

BigQuery has native tables, which are highly optimised when it comes to reading the data whereas Athena is backed by an S3 bucket which provides a convenient way of querying existing data without the need to convert it into a table. How might BigQuery stack up under the same circumstances? Well, let’s find out. BigQuery has a feature called federated sources which allow users to query files stored in Google Cloud Storage (GCS). I exported the licenses table to a CSV file in GCS (same one I used for the Athena test). I then created a federated sources table in BigQuery and referenced the licenses’ CSV file. I ran the same query and to my surprise, It took 13.7 seconds to run. I knew that using a federated sources table would be slower than a native table, however, I never expected almost 10 seconds slower.

Can we get even faster results in Athena?

Athena allows you to partition your data to get even better performance. Similar to partitioned tables in BigQuery, you will only be charged for the data in the partitions that are used in your query. Although unlike BigQuery, there is the ability to partition your data on any column of your choosing. So in the case of the Github licenses table, it makes sense to partition the table on the license column. This would allow any grouping by license name to be extremely fast. Now if your data was already partitioned into separate files it would just be a matter of creating a partitioned table which references your partitioned files in S3. However, if your data is not partitioned, as it was in my case, you will need to partition it yourself. So I found myself writing a simple script to read the license file and write it out into separate files partitioned by license name. So I basically transformed the data from a single CSV file with the schema: repo_name, license to multiple files containing only the repo name, with the file name being the license name. This was hardly ideal, however, let’s see if it was worth it. I ran the above query again and……. Boom! 2.66 seconds. That’s faster than BigQuery on a table that’s querying CSV files.

Let’s try some more data!

In the first query, we were only using the license table, which was not really that interesting. This time I would like to also join on commits table so that we can see which licenses are the most popular by the number of commits. Since the commits table was over 608GB in size, I opted for a trimmed down version. I created a version of the table which contained all repos, mapped to the number of commits recorded against that repo. This would mean joining two tables together that were 3.1 million records each.

Not a bad result, 3.05 seconds to do a join and group by. Now let’s see how the same query will perform using BigQuery?

So it’s slower as we were seeing with the previous query. The performance of Athena is obviously improved significantly by the license table being partitioned by license name. I was seeing some really good performance when dealing with relatively small datasets. What about something much, much larger? What if we try querying a table with more than 2 billion records? The files table lists all files publically stored in Github. I thought it would interesting to see what the most popular file type is based on the file extension.

I was only interested in the path column as it contains the file name so I ran the following query:

SELECT REGEXP_EXTRACT(path, r'.+\.(\w+)$') AS file_ext, COUNT(*) AS files_with_extension FROM `bigquery-public-data.github_repos.files` GROUP BY file_ext ORDER BY files_with_extension DESC

I first ran the query in BigQuery and the result was returned in 17.4 seconds – that’s 17.4 seconds to group together more than 2 billion records. That is really impressive! Athena couldn’t possibly compete with this, could it? Well, let’s find out! I put the query in, hit run and waited. Then a few minutes later…

Athena couldn’t handle 2 billion records? That’s really annoying! Maybe I could try something a little simpler. How about, which repos have the most files?

SELECT repo_name, count(*) AS repos FROM files GROUP BY repo_name ORDER BY repos DESC limit 10

Athena was able to run this query in 28.51 seconds, on the first go without any errors. The same query in BigQuery took only 15.9 seconds. Could I get better performance by partitioning my table and converting it into columnar format? Well I probably could, however, to do this for a table with more than 2 billion records would take quite a lot of time and unfortunately I didn’t get a chance to do this.

Reliability

I have been working with BigQuery quite a lot over the past few years. For the most part, I have found it to be reliable. While some days we do record several hundred failed operations, it generally succeeds when retrying the operation. I can’t say the same for Athena.

The above query, while simple, it was unable to run. I re-tried several times without success. This seemed to happen on several queries. I was just unable to get them working. However running the above query several minutes later seemed to get it to work. Given that Athena is a new product I would expect that there would be some issues and I have since been informed that this is a known issue that they are working on so it would be good to see this fixed.

User Defined Functions (UDF)

What is a UDF? In BigQuery, it’s a Javascript function which can be called as part of your query. The function will take a table row as input and returns 0 or more rows as output. This is quite powerful functionality that allows you to mix SQL and code, you can read more about it here. With Athena, however, UDFs are still not supported. They are on the roadmap which is great to hear but there is no information on how they will work. I would assume that they would work in a similar way to Apache Hive, where the UDF functions are written in Java but we will just have to wait and see.

What Athena does better than BigQuery

Can be faster than BigQuery, especially when querying federated/external tables

Ability to use regex to define a schema, allowing you to query log files without needing to change the format of the file

Can be faster and cheaper than BigQuery by running queries against tables that make use of a Partitioned/Columnar format

Tables can be partitioned on any column

Issues with Athena

It’s not easy to convert data between formats

No ability to modify any data, insert/update/delete/etc

I was randomly getting the error HIVE_UNKNOWN_ERROR which is an issue with Partitioned tables

No support for native tables

No streaming support

Struggles with really large datasets

Quoted CSV files are not fully supported

What BigQuery does better than Athena

It has native table support giving it better performance and more features than Athena

It’s easy to manipulate data, insert/update records, truncate tables, write query results back to tables etc…

Querying native tables is really fast

Easy to convert non-columnar formats into a native table for columnar queries

Supports UDFs, although they will be available in the future for Athena

Supports nested tables

Works well for petabyte scale queries

Final Thoughts

Did Athena slay BigQuery? Athena is a great option for querying data that you already have stored in S3. It offers great performance, and if you are already running your infrastructure on the AWS stack you probably have logs and other files stored in S3. This makes getting up and running with Athena as simple as creating an Athena table and pointing it to your data. You can get better performance by partitioning your data and converting it into a columnar format however this is something you need to take care of yourself. This can be a really tedious process and Athena provides no mechanism for converting your data or manipulating it in any way. It would be nice to have a feature in the console/SDK to convert tables into partitioned and columnar formats. It would also be nice to also have the ability to write the results of a query to a new table or append results to an existing table. Athena also has some work to do to fix some of its broken features and improve reliability.

BigQuery is a mature product that has been around for many years now (since 2010). It is feature rich and extremely fast. I think that allowing the user to partition tables on columns of their choice would offer a huge improvement in the performance and cost of using BigQuery. All said and done, I think BigQuery is the better product at this stage, however, it’s had a big head start over Athena which has a lot of catching up to do.

Update (16 Dec 2016)

I published this blog a week ago today and since then it has received a lot of interest. In particular, I was really intrigued by an email I received from a GM at Amazon on the Athena project. It was an offer for Amazon to convert the GitHub tables into a columnar format for me. I accepted the offer as I really wanted to get an idea of how much performance could be gained and money could be saved by running queries against a table in columnar format. I was also interested in knowing if Athena would be able to find out which file extension was the most popular on GitHub without “exhausting the resources on the cluster”.

I started with the query that did succeed on Athena. Completing in 28.51 seconds when querying a CSV file, how long would it take to the same query against a Parquet file?

Take a look at that! 11.04 seconds to complete a group by on 2 billion records! BigQuery took 15.9 seconds for the same query. It’s looking like BigQuery may be getting a run for its money when it comes to columnar queries. There is also something else I am noticing. If you look at the amount of data scanned in the Athena query it’s only 11.3 GB. The same query running in BigQuery processed 50.7GB of data. How could this be possible when they are both only scanning the repo_name column? Well, Parquet is a compressed file format which Athena can scan without uncompressing the data. This is also happening on BigQuery, however, you are still charged for processing the uncompressed data. This has a significant cost benefit as you only get charged US$5 per TB processed in both platforms.

Now, there was that query that failed last time when querying a CSV file. I was trying to find out which file extension was the most popular across all GitHub projects. Last time I exhausted the resources of Athena. How would it go this time with the Parquet file? This query took 17.4 seconds in BigQuery, let’s see what happens this time in Athena.

Aghhhh, not again! Slightly different error message but it’s still exhausting resources. Well, maybe next time Athena. Looks like you won’t be the slayer of BigQuery this time!