Using Parquet On Amazon Athena For AWS Cost Optimization

To show you how you can optimize your AWS Athena query and save money, we will use the ‘2018 Flight On-Time Performance’ dataset from the Bureau of Transportation Statistics (bts.gov). We will also drop a few interesting facts about US Airports ✈️queried from the dataset while using Amazon Athena.

All datasets and queries used in this post can be found in our Github repo.

Let’s start with some terminology

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Pricing is based on the amount of data scanned by each query.

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.

If pricing is based on the amount of data scanned, you should always optimize your dataset to process the least amount of data using one of the following techniques: compressing, partitioning and using a columnar file format. We will demonstrate the benefits of compression and using a columnar format.

Using compressions will reduce the amount of data scanned by Amazon Athena, and also reduce your S3 bucket storage. It’s a Win-Win for your AWS bill. Supported formats: GZIP, LZO, SNAPPY (Parquet) and ZLIB.

Instead of using a row-level approach, columnar format is storing data by columns. This allows Athena to only query and process the required columns and ignore the rest. If you want to learn more about columnar, check out [Wikipedia](https://en.wikipedia.org/wiki/RCFile](https://en.wikipedia.org/wiki/RCFile) and/or the “The beauty of column-oriented data” article by Maxim Zaks

Getting familiar with the data

While using the ‘2018 Flight On-Time Performance’ dataset, we will use three different types of files to compare performance with the data processed: CSV, GZip, and Parquet files.

The files downloaded from the Bureau of Transportation Statistics are simple CSV files with 23 columns (such as FlightDate, Airline, Flight #, Origin, Destination, Delay, Cancelled, …)

Same data, 3 versions:

Raw CSV: Description: No compression, just a plain set of CSV files Files: 12 ~55MB files (one for each month) Total dataset size: ~666MBs

GZip CSV : Description: Simple CSV files compressed using GZip to compress them. Files: 12 ~10MB Gzipped CSV files (one for each month). Total dataset size: ~126MBs

: Compressed Parquet : Description: We converted to the CSV file to parquet using Spark. The same process could also be done with (AWS Glue) Files: 12 ~8MB Parquet file using the default compression (Snappy). Total dataset size: ~84MBs

:

Find the three dataset versions on our Github repo.

Creating the various tables

Since the various formats and/or compressions are different, each CREATE statement needs to indicate to AWS Athena which format/compression it should use.

Raw CSVs

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 CREATE EXTERNAL TABLE IF NOT EXISTS flights.raw_data ( `year` SMALLINT, `month` SMALLINT, `day_of_month` SMALLINT, `flight_date` STRING, `op_unique_carrier` STRING, `flight_num` STRING, `origin` STRING, `destination` STRING, `crs_dep_time` STRING, `dep_time` STRING, `dep_delay` DOUBLE, `taxi_out` DOUBLE, `wheels_off` STRING, `arr_delay` DOUBLE, `cancelled` DOUBLE, `cancellation_code` STRING, `diverted` DOUBLE, `air_time` DOUBLE, `carrier_delay` DOUBLE, `weather_delay` DOUBLE, `nas_delay` DOUBLE, `security_delay` DOUBLE, `late_aircraft_delay` DOUBLE ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\"' LINES TERMINATED BY '

' LOCATION 's3://INSERT_BUCKET_NAME/raw' TBLPROPERTIES ( 'skip.header.line.count'='1', 'serialization.null.format'='' );

Gzipped CSVs

Athena automatically detects the gzip format (based on the “.gz” suffix) so we can re-use the query from above. See the full query here.

Apache Parquet Files using Snappy

Using both STORED AS PARQUET and "parquet.compress"="SNAPPY" , Amazon Athena will be able to process our data flawlessly.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 CREATE EXTERNAL TABLE IF NOT EXISTS flights.parquet_snappy_data ( `year` SMALLINT, `month` SMALLINT, `day_of_month` SMALLINT, `flight_date` STRING, `op_unique_carrier` STRING, `flight_num` STRING, `origin` STRING, `destination` STRING, `crs_dep_time` STRING, `dep_time` STRING, `dep_delay` DOUBLE, `taxi_out` DOUBLE, `wheels_off` STRING, `arr_delay` DOUBLE, `cancelled` DOUBLE, `cancellation_code` STRING, `diverted` DOUBLE, `air_time` DOUBLE, `carrier_delay` DOUBLE, `weather_delay` DOUBLE, `nas_delay` DOUBLE, `security_delay` DOUBLE, `late_aircraft_delay` DOUBLE ) STORED AS PARQUET LOCATION 's3://INSERT_BUCKET_NAME/parquet' tblproperties ("parquet.compress"="SNAPPY")

Picking a couple of queries to play with

Query #1: Top route in the U.S. in 2018 alongside the number of airlines providing flights for it.

Spoiler Alert: It’s Chicago O’Hare International to New York LaGuardia with 6 unique airlines offering the route

1 2 3 4 5 6 7 8 9 SELECT origin, destination, count(*) as total_flights, count(distinct op_unique_carrier) as uniq_airlines FROM flights.parquet_snappy_data GROUP BY origin, destination ORDER BY total_flights DESC LIMIT 1

Query #2: Best airport to fly out from to avoid any major delay (> 30 minutes)

Spoiler Alert: Honolulu International Airport has less than 4% flight delayed, giving you one more reason to visit Hawaii. Newark Airport had an average of 16.2% flights delayed (>30 minutes).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 SELECT origin, percent_delayed as "% Delayed (> 30)", total_flights as "Total Flights" FROM ( SELECT origin, DENSE_RANK() OVER (ORDER BY percent_delayed desc) AS worst_rank, DENSE_RANK() OVER (ORDER BY percent_delayed asc) AS top_rank, percent_delayed, total_flights FROM ( SELECT origin, sum(CAST (dep_delay > 30 as DOUBLE)) / count(*) as percent_delayed, count(*) as total_flights FROM flights.parquet_snappy_data GROUP BY 1 HAVING COUNT(*) > 10000 ) as t1 ) as t2 WHERE top_rank <= 1 OR worst_rank <= 3

Comparing performances

For this category, I ran each query 10 times and averaged the numbers. We will use the Raw dataset as our baseline.

Query #1 Raw: 665.71 MB scanned in 2.60 seconds. $0.00333 per query GZip: 125.54 MB scanned in 2.08 seconds. $0.00063 per query ( -81% savings ). Parquet: 8.29 MB scanned in 1.26 seconds. $0.000050 per query ( -98% savings ).

Query #2 Raw: 665.71 MB scanned in 2.60 seconds. $0.00333 per query. GZip: 125.54 MB scanned in 2.08 seconds. $0.00063 per query ( -81% savings ). Parquet: 8.29 MB scanned in 0.81 seconds. $0.000062 per query ( -98% savings ).



Conclusions

Thanks to Apache Parquet’s columnar format, AWS Athena is only reading the columns that are needed from the query. This reduces the query time by more than 50+% and reduces the query price by 98%.

If you have any questions about Amazon Athena, Apache Parquet or CloudForecast, feel free to ping me via email francois@cloudforecast.io or by Twitter: @francoislagier.

Want to try CloudForecast? Sign up today and get started with a risk-free 30 day free trial. No credit card required.