Amazon Athena is server-less way to query your data that lives on S3 using SQL. It excels with datasets that are anywhere up to multiple petabytes in size. It's interface is a simple web page that you can access from the AWS console. Queries cost $5 per terabyte of data scanned with a 10 MB minimum per query. If a query fails you won't be charged for it but if you cancel a query part way through you'll be charged for the data scanned up until that point.

Data can be stored in CSV, JSON, ORC, Parquet and even Apache web logs format. You can even use compressed CSV files in GZIP format to save on query costs and improve performance over regular, uncompressed CSV files.

Athena can execute queries from either the us-east-1 (North Virginia) or the us-west-2 (Oregon) regions though the S3 data being queried can live in other parts of the world.

It's important to note that Athena is not a general purpose database. Under the hood is Presto, a query execution engine that runs on top of the Hadoop stack. Athena's purpose is to ask questions rather than insert records quickly or update random records with low latency.

That being said, Presto's performance, given it can work on some of the world's largest datasets, is impressive. Presto is used daily by analysts at Facebook on their multi-petabyte data warehouse so the fact that such a powerful tool is available via a simple web interface with no servers to manage is pretty amazing to say the least.

In this blog post I'll see how fast AWS Athena can query 1.1 billion taxi rides. The set of data I'm working with was originally put together from their original data sources with scripts written by Todd W Schneider. In my Billion Taxi Rides in Redshift blog post I exported the data from PostgreSQL into denormalised CSV files and compressed them into 56 gzip files. The files in GZIP form are 104 GB in size and decompressed take up around 500 GB of space. This is the same dataset I've used to benchmark BigQuery, Dataproc, Elasticsearch, EMR, MapD, PostgreSQL and Redshift.

AWS CLI Up and Running All the following commands were run on a fresh install of Ubuntu 14.04.3. To start, I'll install the AWS CLI tool and a few dependencies it needs to run. $ sudo apt update $ sudo apt install \ python-pip \ python-virtualenv $ virtualenv amazon $ source amazon/bin/activate $ pip install awscli I'll then enter my AWS credentials. $ read AWS_ACCESS_KEY_ID $ read AWS_SECRET_ACCESS_KEY $ export AWS_ACCESS_KEY_ID $ export AWS_SECRET_ACCESS_KEY I'll run configure to make sure us-east-1 is my default region. $ aws configure AWS Access Key ID [********************]: AWS Secret Access Key [********************]: Default region name [us-east-1]: us-east-1 Default output format [None]: The following will allow for 100 concurrent requests to run when interacting with S3 and does a good job of saturating my Internet connection. $ aws configure set \ default.s3.max_concurrent_requests \ 100 I've created an S3 bucket in the US Standard region called "taxis-ath". The following will sync the 56 compressed CSV files on my system with that bucket into a folder called "csv". $ aws s3 sync 20M-blocks/ \ s3://taxis-ath/csv/

Setting Up Athena When you first open the Athena web interface there is a step-by-step setup wizard to help you create a table and point it at your data. There is a step where you need to add each of the fields in for the table using buttons and drop downs. Because I have 51 fields and they can be added via SQL I simply created a dummy table to just get past the wizard and get to the query editor where I was able to run the following to setup my trips table. CREATE EXTERNAL TABLE trips_csv ( trip_id INT , vendor_id VARCHAR ( 3 ), pickup_datetime TIMESTAMP , dropoff_datetime TIMESTAMP , store_and_fwd_flag VARCHAR ( 1 ), rate_code_id SMALLINT , pickup_longitude DECIMAL ( 18 , 14 ), pickup_latitude DECIMAL ( 18 , 14 ), dropoff_longitude DECIMAL ( 18 , 14 ), dropoff_latitude DECIMAL ( 18 , 14 ), passenger_count SMALLINT , trip_distance DECIMAL ( 6 , 3 ), fare_amount DECIMAL ( 6 , 2 ), extra DECIMAL ( 6 , 2 ), mta_tax DECIMAL ( 6 , 2 ), tip_amount DECIMAL ( 6 , 2 ), tolls_amount DECIMAL ( 6 , 2 ), ehail_fee DECIMAL ( 6 , 2 ), improvement_surcharge DECIMAL ( 6 , 2 ), total_amount DECIMAL ( 6 , 2 ), payment_type VARCHAR ( 3 ), trip_type SMALLINT , pickup VARCHAR ( 50 ), dropoff VARCHAR ( 50 ), cab_type VARCHAR ( 6 ), precipitation SMALLINT , snow_depth SMALLINT , snowfall SMALLINT , max_temperature SMALLINT , min_temperature SMALLINT , average_wind_speed SMALLINT , pickup_nyct2010_gid SMALLINT , pickup_ctlabel VARCHAR ( 10 ), pickup_borocode SMALLINT , pickup_boroname VARCHAR ( 13 ), pickup_ct2010 VARCHAR ( 6 ), pickup_boroct2010 VARCHAR ( 7 ), pickup_cdeligibil VARCHAR ( 1 ), pickup_ntacode VARCHAR ( 4 ), pickup_ntaname VARCHAR ( 56 ), pickup_puma VARCHAR ( 4 ), dropoff_nyct2010_gid SMALLINT , dropoff_ctlabel VARCHAR ( 10 ), dropoff_borocode SMALLINT , dropoff_boroname VARCHAR ( 13 ), dropoff_ct2010 VARCHAR ( 6 ), dropoff_boroct2010 VARCHAR ( 7 ), dropoff_cdeligibil VARCHAR ( 1 ), dropoff_ntacode VARCHAR ( 4 ), dropoff_ntaname VARCHAR ( 56 ), dropoff_puma VARCHAR ( 4 ) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://taxis-ath/csv/' ;

The CSV Benchmark With the table setup I ran the following queries to see how quickly they could return their results. The times quoted below are the lowest query durations seen. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed". The following completed in 1 minute 53 seconds and scanned 104.26 GB of data. SELECT cab_type , count ( * ) FROM trips_csv GROUP BY cab_type ; The following completed in 1 minute 56 seconds and scanned 104.25 GB of data. SELECT passenger_count , avg ( total_amount ) FROM trips_csv GROUP BY passenger_count ; The following completed in 1 minute 54 seconds and scanned 104.26 GB of data. SELECT passenger_count , year ( pickup_datetime ), count ( * ) FROM trips_csv GROUP BY passenger_count , year ( pickup_datetime ); The following completed in 2 minutes 35 seconds and scanned 104.3 GB of data. SELECT passenger_count , year ( pickup_datetime ) trip_year , round ( trip_distance ), count ( * ) trips FROM trips_csv GROUP BY passenger_count , year ( pickup_datetime ), round ( trip_distance ) ORDER BY trip_year , trips desc ;

Optimising using Columnar Storage Unlike Redshift and BigQuery, Athena doesn't store your data internally so it won't sort, compress and organise the data into columns for you. CSV files don't have summaries of the data they contain in each column and not even by groups of rows. If data is laid out one column at a time then a query can use just its columns of interest rather than every row in the dataset as it hunts down the columns of data it's interested in. If that data is sorted then it can have summaries every x many records giving statistics on those values. This allows a query to skip over whole sections of a dataset. For queries where looking at the value of each cell of information is required, if it reads the data off the disk in a compressed form then it'll not be bottlenecked by the storage device so much. Reading 1 GB of data off a disk will always be faster than reading 10 GB of data off a disk. The above three optimisations are the kinds you get when you store your data in ORC or Parquet format. If you convert your CSV data into one of these formats you will not only lower your costs of each query, you'll see significant speed increases as well. You can think of ORC and Parquet as the ZIP and RAR formats of the Hadoop ecosystem. Normally in a query execution tool like Hive on a Hadoop cluster you can run the following to read data from a CSV-formatted table into an ORC formatted table. INSERT INTO trips_orc SELECT * FROM trips_csv ; Unfortunately, Athena will return an error stating this functionality has been restricted from their interface. Your query has the following error(s): Queries of this type are not supported (Service: AmazonAthena; Status Code: 400; Error Code: InvalidRequestException; Request ID: 1a5d7606-c117-11e6-8d0b-2dd8e091a03e)

Setting up an EMR Cluster To get around the data conversion restrictions in Athena I'll launch an EMR cluster to convert the data. Do note, there are other ways, including using a stand-alone Spark installation, to convert your data from CSV into Parquet and/or ORC format. Using EMR allows me to convert the data with a minimal amount of shell and SQL scripts, is very reliable and can be scaled out to the petabyte level if need be. I'll be using a cluster of m3.xlarge instances in the us-east-1 region running EMR 5.2.0 on Amazon's 2.7.3 Hadoop Distribution with Hive as my only optional install. There will be one master node which will be an on-demand instance as we cannot afford to have it disappear on us and 4 spot instances, 2 core nodes and 2 task nodes, which Hadoop should be able to handle the loss of should one of them disappear due to rising spot prices. The machines have a combined 20 vCPUs, 75 GB of RAM and 400 GB of SSD capacity across 10 drives. The master node costs $0.266 / hour and I've set the maximum price I'm willing to spend on the other 4 nodes to $0.06 / hour each. I ran the following to create the cluster. It was provisioned and bootstrapped in 9 minutes. $ aws emr create-cluster \ --name 'My cluster' \ --region us-east-1 \ --release-label emr-5.2.0 \ --applications \ Name = Hadoop \ Name = Hive \ --instance-groups '[{ "InstanceCount":2, "BidPrice":"0.06", "InstanceGroupType":"CORE", "InstanceType":"m3.xlarge", "Name":"Core - 2" },{ "InstanceCount":2, "BidPrice":"0.06", "InstanceGroupType":"TASK", "InstanceType":"m3.xlarge", "Name":"Task - 3" },{ "InstanceCount":1, "InstanceGroupType":"MASTER", "InstanceType":"m3.xlarge", "Name":"Master - 1" }]' \ --ec2-attributes '{ "KeyName":"emr", "InstanceProfile":"EMR_EC2_DefaultRole", "SubnetId":"subnet-acf1bfda", "EmrManagedSlaveSecurityGroup":"sg-2d321350", "EmrManagedMasterSecurityGroup":"sg-3332134e" }' \ --auto-scaling-role EMR_AutoScaling_DefaultRole \ --service-role EMR_DefaultRole \ --termination-protected \ --enable-debugging \ --scale-down-behavior TERMINATE_AT_INSTANCE_HOUR \ --log-uri 's3n://aws-logs-591231097547-us-east-1/elasticmapreduce/' With the cluster up I can SSH into the master node. I've added a ServerAliveInterval parameter to my SSH connection settings so that my terminal shouldn't time out if there is a period of no conversation between my machine and the master node. $ ssh -i ~/.ssh/emr.pem \ -o ServerAliveInterval = 50 \ hadoop@54.89.167.90 Once connected, I was greeted with the following. __| __|_ ) _| ( / Amazon Linux AMI ___|\___|___| https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/ 6 package(s) needed for security, out of 10 available Run "sudo yum update" to apply all updates. EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R E::::E M::::::M:::M M:::M::::::M R:::R R::::R E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R E::::E M:::::M M:::M M:::::M R:::R R::::R E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR [hadoop@ip-172-30-3-18 ~]$ I'll first enter a screen so I can leave work running and not worry about an SSH connection failure stopping my workload and then I'll launch Hive. $ screen $ hive I'll create a table for my source CSV data. It will use the existing s3://taxis-ath/csv/ folder of data that I uploaded earlier. Once this table has been created the data is accessible right away. CREATE EXTERNAL TABLE trips_csv ( trip_id INT , vendor_id VARCHAR ( 3 ), pickup_datetime TIMESTAMP , dropoff_datetime TIMESTAMP , store_and_fwd_flag VARCHAR ( 1 ), rate_code_id SMALLINT , pickup_longitude DECIMAL ( 18 , 14 ), pickup_latitude DECIMAL ( 18 , 14 ), dropoff_longitude DECIMAL ( 18 , 14 ), dropoff_latitude DECIMAL ( 18 , 14 ), passenger_count SMALLINT , trip_distance DECIMAL ( 6 , 3 ), fare_amount DECIMAL ( 6 , 2 ), extra DECIMAL ( 6 , 2 ), mta_tax DECIMAL ( 6 , 2 ), tip_amount DECIMAL ( 6 , 2 ), tolls_amount DECIMAL ( 6 , 2 ), ehail_fee DECIMAL ( 6 , 2 ), improvement_surcharge DECIMAL ( 6 , 2 ), total_amount DECIMAL ( 6 , 2 ), payment_type VARCHAR ( 3 ), trip_type SMALLINT , pickup VARCHAR ( 50 ), dropoff VARCHAR ( 50 ), cab_type VARCHAR ( 6 ), precipitation SMALLINT , snow_depth SMALLINT , snowfall SMALLINT , max_temperature SMALLINT , min_temperature SMALLINT , average_wind_speed SMALLINT , pickup_nyct2010_gid SMALLINT , pickup_ctlabel VARCHAR ( 10 ), pickup_borocode SMALLINT , pickup_boroname VARCHAR ( 13 ), pickup_ct2010 VARCHAR ( 6 ), pickup_boroct2010 VARCHAR ( 7 ), pickup_cdeligibil VARCHAR ( 1 ), pickup_ntacode VARCHAR ( 4 ), pickup_ntaname VARCHAR ( 56 ), pickup_puma VARCHAR ( 4 ), dropoff_nyct2010_gid SMALLINT , dropoff_ctlabel VARCHAR ( 10 ), dropoff_borocode SMALLINT , dropoff_boroname VARCHAR ( 13 ), dropoff_ct2010 VARCHAR ( 6 ), dropoff_boroct2010 VARCHAR ( 7 ), dropoff_cdeligibil VARCHAR ( 1 ), dropoff_ntacode VARCHAR ( 4 ), dropoff_ntaname VARCHAR ( 56 ), dropoff_puma VARCHAR ( 4 ) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://taxis-ath/csv/' ; I'll then create a table to store the data in ORC Format. The data for this table will live in s3://taxis-ath/orc/ . CREATE EXTERNAL TABLE trips_orc ( trip_id INT , vendor_id STRING , pickup_datetime TIMESTAMP , dropoff_datetime TIMESTAMP , store_and_fwd_flag STRING , rate_code_id SMALLINT , pickup_longitude DOUBLE , pickup_latitude DOUBLE , dropoff_longitude DOUBLE , dropoff_latitude DOUBLE , passenger_count SMALLINT , trip_distance DOUBLE , fare_amount DOUBLE , extra DOUBLE , mta_tax DOUBLE , tip_amount DOUBLE , tolls_amount DOUBLE , ehail_fee DOUBLE , improvement_surcharge DOUBLE , total_amount DOUBLE , payment_type STRING , trip_type SMALLINT , pickup STRING , dropoff STRING , cab_type STRING , precipitation SMALLINT , snow_depth SMALLINT , snowfall SMALLINT , max_temperature SMALLINT , min_temperature SMALLINT , average_wind_speed SMALLINT , pickup_nyct2010_gid SMALLINT , pickup_ctlabel STRING , pickup_borocode SMALLINT , pickup_boroname STRING , pickup_ct2010 STRING , pickup_boroct2010 STRING , pickup_cdeligibil STRING , pickup_ntacode STRING , pickup_ntaname STRING , pickup_puma STRING , dropoff_nyct2010_gid SMALLINT , dropoff_ctlabel STRING , dropoff_borocode SMALLINT , dropoff_boroname STRING , dropoff_ct2010 STRING , dropoff_boroct2010 STRING , dropoff_cdeligibil STRING , dropoff_ntacode STRING , dropoff_ntaname STRING , dropoff_puma STRING ) STORED AS orc LOCATION 's3://taxis-ath/orc/' ; I'll then create a table to store the data in Parquet Format. The data for this table will live in s3://taxis-ath/parquet/ . CREATE EXTERNAL TABLE trips_parquet ( trip_id INT , vendor_id STRING , pickup_datetime TIMESTAMP , dropoff_datetime TIMESTAMP , store_and_fwd_flag STRING , rate_code_id SMALLINT , pickup_longitude DOUBLE , pickup_latitude DOUBLE , dropoff_longitude DOUBLE , dropoff_latitude DOUBLE , passenger_count SMALLINT , trip_distance DOUBLE , fare_amount DOUBLE , extra DOUBLE , mta_tax DOUBLE , tip_amount DOUBLE , tolls_amount DOUBLE , ehail_fee DOUBLE , improvement_surcharge DOUBLE , total_amount DOUBLE , payment_type STRING , trip_type SMALLINT , pickup STRING , dropoff STRING , cab_type STRING , precipitation SMALLINT , snow_depth SMALLINT , snowfall SMALLINT , max_temperature SMALLINT , min_temperature SMALLINT , average_wind_speed SMALLINT , pickup_nyct2010_gid SMALLINT , pickup_ctlabel STRING , pickup_borocode SMALLINT , pickup_boroname STRING , pickup_ct2010 STRING , pickup_boroct2010 STRING , pickup_cdeligibil STRING , pickup_ntacode STRING , pickup_ntaname STRING , pickup_puma STRING , dropoff_nyct2010_gid SMALLINT , dropoff_ctlabel STRING , dropoff_borocode SMALLINT , dropoff_boroname STRING , dropoff_ct2010 STRING , dropoff_boroct2010 STRING , dropoff_cdeligibil STRING , dropoff_ntacode STRING , dropoff_ntaname STRING , dropoff_puma STRING ) STORED AS parquet LOCATION 's3://taxis-ath/parquet/' ; I've also created both the trips_orc and trips_parquet tables in Athena's query editor as well so once those tables' S3 buckets are populated by Hive their data will be available in Athena's interface as well.

Populating the Tables The following Hive query will populate the ORC-formatted table. The following completed in 2 hours and 47 minutes. INSERT INTO TABLE trips_orc SELECT * FROM trips_csv ; Query ID = hadoop_20161214085649_60eb11e1-cf66-4e44-8d3b-2fff1eab86f1 Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1481705329458_0001) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 56 56 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 9509.19 s ---------------------------------------------------------------------------------------------- Loading data to table default.trips_orc OK Time taken: 10029.06 seconds The following Hive query will populate the Parquet-formatted table. The following completed in 3 hours and 13 minutes (~15% slower than the ORC table). INSERT INTO TABLE trips_parquet SELECT * FROM trips_csv ; Query ID = hadoop_20161214211841_5e63d67d-7f29-4e54-9790-bbb8a864d97b Total jobs = 1 Launching Job 1 out of 1 Status: Running (Executing on YARN cluster with App id application_1481749215350_0001) ---------------------------------------------------------------------------------------------- VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED ---------------------------------------------------------------------------------------------- Map 1 .......... container SUCCEEDED 56 56 0 0 0 0 ---------------------------------------------------------------------------------------------- VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 10315.67 s ---------------------------------------------------------------------------------------------- Loading data to table default.trips_parquet OK Time taken: 11623.85 seconds These tables were populated using stock Hive settings for their formats. There is a lot of room for tuning each format's parameters though I've found EMR's Hive settings have had a lot of attention already by AWS' engineers and are pretty good out of the box. The above queries did run at very different times of the day so I wouldn't consider this a good ORC vs Parquet shoot-out. Also, different tools in the Hadoop ecosystem have different levels of performance optimisation for each of these formats so it's important to know where your data will be consumed and what the query patterns will look like before deciding on which format to go with for your dataset.

The ORC Benchmark With the ORC data loaded I ran the following queries in Athena's query editor. The times quoted below are the lowest query durations seen. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed". The following completed in 7.5 seconds and scanned 52.81 MB of data. SELECT cab_type , count ( * ) FROM trips_orc GROUP BY cab_type ; The following completed in 4.99 seconds and scanned 2.69 GB of data. SELECT passenger_count , avg ( total_amount ) FROM trips_orc GROUP BY passenger_count ; The following completed in 4.85 seconds and scanned 4.87 GB of data. SELECT passenger_count , year ( pickup_datetime ), count ( * ) FROM trips_orc GROUP BY passenger_count , year ( pickup_datetime ); The following completed in 3.98 seconds and scanned 7.26 GB of data. SELECT passenger_count , year ( pickup_datetime ) trip_year , round ( trip_distance ), count ( * ) trips FROM trips_orc GROUP BY passenger_count , year ( pickup_datetime ), round ( trip_distance ) ORDER BY trip_year , trips desc ; The 52.81 MB of scanned data in the first query is very significant. When this query ran on the CSV table there was 104.26 GB of data scanned. You're charged based on amount of data scanned in each query on Athena so running the same query on ORC data made it ~2,022x cheaper. The ORC query was also 15x faster as well. This is why it's important to setup an ETL process to convert a copy of your data from CSV, JSON and whatnot into a columnar format like ORC. I tend to use Airflow for setting up these sorts of automated tasks though there are more AWS-centric ways of tackling this problem as well. All of the above queries were miles quicker and cheaper to run with the data in ORC format compared to compressed CSVs.

The Parquet Benchmark With the ORC benchmark complete I turned to running the same queries on the Parquet-formatted table. The times quoted below are the lowest query durations seen. As with all my benchmarks, I use the lowest query time as a way of indicating "top speed". The following completed in 6.41 seconds and scanned 603.69 KB of data. SELECT cab_type , count ( * ) FROM trips_parquet GROUP BY cab_type ; The following completed in 6.19 seconds and scanned 102.29 GB of data. SELECT passenger_count , avg ( total_amount ) FROM trips_parquet GROUP BY passenger_count ; The following completed in 6.09 seconds and scanned 102.29 GB of data. SELECT passenger_count , year ( pickup_datetime ), count ( * ) FROM trips_parquet GROUP BY passenger_count , year ( pickup_datetime ); The following completed in 6.63 seconds and scanned 102.3 GB of data. SELECT passenger_count , year ( pickup_datetime ) trip_year , round ( trip_distance ), count ( * ) trips FROM trips_parquet GROUP BY passenger_count , year ( pickup_datetime ), round ( trip_distance ) ORDER BY trip_year , trips desc ; The query times generally didn't outperform the ORC times but the amount of data being scanned is interesting. 3 of the 4 queries scanned almost as much data as the CSV queries making them almost the same price on Athena. But the first query was able to take advantage of Parquet's summary statistics and only read 603.69 KB of data. There is a 10 MB minimum to each query on Athena so the cost was only 10,676x cheaper than querying the CSV data and around ~5x cheaper than the same query on the ORC-formatted table. I'm suspecting there are improvements that could be made in the way Presto works with Parquet-formatted data. I've seen in the past that ORC-formatted tables generally performed better but I haven't examined their release notes in detail for a while now and there could be some interesting areas of potential optimisation revealed in them.