UPDATE – I have a more modern version of this post with larger data sets available here.

This data analysis project is to explore what insights can be derived from the Airline On-Time Performance data set collected by the United States Department of Transportation. The data can be downloaded in month chunks from the Bureau of Transportation Statistics website. The data gets downloaded as a raw CSV file, which is something that Spark can easily load. However, if you download 10+ years of data from the Bureau of Transportation Statistics (meaning you downloaded 120+ one month CSV files from the site), that would collectively represent 30+ GB of data. For commercial scale Spark clusters, 30 GB of text data is a trivial task. However, if you are running Spark on the ODROID XU4 cluster or in local mode on your Mac laptop, 30+ GB of text data is substantial. So, before we can do any analysis of the dataset, we need to transform it into a format that will allow us to quickly and efficiently interact with it.

Fortunately, data frames and the Parquet file format fit the bill nicely. Parquet is a compressed columnar file format. Columnar file formats greatly enhance data file interaction speed and compression by organizing data by columns rather than by rows. The two main advantages of a columnar format is that queries will deserialize only that data which is actually needed, and compression is frequently much better since columns frequently contained highly repeated values.

To explain why the first benefit is so impactful, consider a structured data table with the following format:

FlightData |-- FlightDate: date (nullable = true) |-- UniqueCarrier: string (nullable = true) |-- AirlineID: long (nullable = true) |-- Carrier: string (nullable = true) |-- TailNum: string (nullable = true) |-- FlightNum: integer (nullable = true) |-- Origin: string (nullable = true) |-- Dest: string (nullable = true)

And for the sake of discussion, consider this query against the table:

SELECT Carrier, Year, COUNT( DISTINCT TailNum ) FROM FlightData GROUP BY Carrier, Year

As you can see, there are only three fields from the original table that matter to this query, Carrier, Year and TailNum. In a traditional row format, such as CSV, in order for a data engine (such as Spark) to get the relevant data from each row to perform the query, it actually has to read the entire row of data to find the fields it needs. If the data table has many columns and the query is only interested in three, the data engine will be force to deserialize much more data off the disk than is needed. In any data operation, reading the data off disk is frequently the slowest operation. Doing anything to reduce the amount of data that needs to be read off the disk would speed up the operation significantly. Solving this problem is exactly what a columnar data format like Parquet is intended to solve.

A CSV file is a row-centric format. If you want to interact with a large data table backed by CSV files, it will be slow. You can, however, speed up your interactions with the CSV data by converting it to a columnar format. This will be our first goal with the Airline On-Time Performance data.

One thing to note with the the process described below: I am using QFS with Spark to do my analysis. If you prefer to use HDFS with Spark, simply update all file paths and file system commands as appropriate.

Getting Airline On-Time Performance Data

As indicated above, the Airline Io-Time Performance data is available at the Bureau of Transportation Statistics website. The challenge with downloading the data is that you can only download one month at a time. Therefore, to download 10 years worth of data, you would have to adjust the selection month and download 120 times. This is time consuming. But I went ahead and downloaded eleven years worth of data so you don’t have to. You can download it here:

I have also made a smaller, 3-year data set available here:

Note that expanding the 11 year data set will create a folder that is 33 GB in size. If you are doing this on the master node of the ODROID cluster, that is far too large for the eMMC drive. You could expand the file into the MicroSD card found at the /data mount point, but I wouldn’t recommend it as that is half the MicroSD card’s space (at least the 64 GB size I originally specced). I would suggest two workable options: attach a sufficiently sized USB thumb drive to the master node (ideally a USB 3 thumb drive) and use that as a working drive, or download the data to your personal computer or laptop and access the data from the master node through a file sharing method.

I went with the second method. To “mount” my Mac laptop from the cluster’s mast now, I used sshfs which simulates a mounted hard rive through behind-the-scenes SSH and SCP commands. This, of course, required my Mac laptop to have SSH connections turned on. To install and create a mount point:

sudo apt-get install sshfs sudo mkdir /mnt/macbookpro sudo sshfs -o allow_other myaccount@XX.XX.XX.XX:/ /mnt/macbookpro

Update the name of the mount point, IP address of your computer, and your account on that computer as necessary.

Once you have downloaded and uncompressed the dataset, the next step is to place the data on the distributed file system. QFS has has some nice tools that mirror many of the HDFS tools and enable you to do this easily. The key command being the cptoqfs command. I prefer uploading the files to the file system one at a time. To do that, I wrote this script (update the various file paths for your set up):

#!/bin/bash for filename in /mmt/macbookpro/path/to/airline-data/*.csv; do echo "uploading - ${filename}" ( cptoqfs -s master -p 20000 -d "${filename}" -k /user/michael/data/airline -r 2 ) done

This will take a couple hours on the ODROID Xu4 cluster as you are upload 33 GB of data.

Converting Data to a Partitioned Parquet File

The next step is to convert all those CSV files uploaded to QFS is to convert them to the Parquet columnar format. This will be challenging on our ODROID XU4 cluster because there is not sufficient RAM across all the nodes to hold all of the CSV files for processing. Furthermore, the cluster can easily run out of disk space or the computations become unnecessarily slow if the means by which we combine the 11 years worth of CSVs requires a significant amount of shuffling of data between nodes. In general, shuffling data between nodes should be minimized, regardless of your cluster’s size.

To minimize the need to shuffle data between nodes, we are going to transform each CSV file directly into a partition within the overall Parquet file. A partition is a subset of the data that all share the same value for a particular key. Parquet files can create partitions through a folder naming strategy. For example, if data in a Parquet file is to be partitioned by the field named year, the Parquet file’s folder structure would look like this:

data.paquet/ |-- year=2005/ | |-- part-r-00000.parquet | |-- part-r-00001.parquet |-- year=2006/ | |-- part-r-00002.parquet | |-- part-r-00003.parquet |-- year=2007/ | |-- part-r-00004.parquet | |-- part-r-00005.parquet . . . |-- year=2016/ |-- part-r-00023.parquet |-- part-r-00024.parquet

The advantage of partitioning data in this manner is that a client of the data only needs to read a subset of the data if it is only interested in a subset of the partitioning key values. As an example, consider this SQL query:

SELECT Carrier, Year, COUNT( DISTINCT TailNum ) FROM FlightData WHERE Year in (2006,2007,2008) GROUP BY Carrier, Year

The WHERE clause indicates that the query is only interested in the years 2006 through 2008. This fact can be taken advantage of with a data set partitioned by year in that only data from the partitions for the targeted years will be read when calculating the query’s results. As a result, the partitioning has greatly sped up the query bit reducing the amount of data that needs to be deserialized from disk.

The other property of partitioned Parquet files we are going to take advantage of is that each partition within the overall file can be created and written fairly independently of all other partitions. What this means is that one node in the cluster can write one partition with very little coordination with the other nodes, most notably with very little to no need to shuffle data between nodes. For 11 years of the airline data set there are 132 different CSV files. Since those 132 CSV files were already effectively partitioned, we can minimize the need for shuffling by mapping each CSV file directly into its partition within the Parquet file.

The way to do this is to map each CSV file into its own partition within the Parquet file. Since each CSV file in the Airline On-Time Performance data set represents exactly one month of data, the natural partitioning to pursue is a month partition.

The Code

So now that we understand the plan, we will execute own it. The first step is to lead each CSV file into a data frame. An important element of doing this is setting the schema for the data frame. Create a notebook in Jupyter dedicated to this data transformation, and enter this into the first cell:

from pyspark.sql.types import StructType, StructField from pyspark.sql.types import DoubleType, IntegerType, StringType, DecimalType, LongType air_schema = StructType([ StructField("Year", IntegerType()), StructField("Quarter", IntegerType()), StructField("Month", IntegerType()), StructField("DayofMonth", IntegerType()), StructField("DayOfWeek", IntegerType()), StructField("FlightDate", StringType()), StructField("UniqueCarrier", StringType()), StructField("AirlineID", LongType()), StructField("Carrier", StringType()), StructField("TailNum", StringType()), StructField("FlightNum", IntegerType()), StructField("OriginAirportID", IntegerType()), StructField("OriginAirportSeqID", IntegerType()), StructField("OriginCityMarketID", IntegerType()), StructField("Origin", StringType()), StructField("OriginCityName", StringType()), StructField("OriginState", StringType()), StructField("OriginStateFips", IntegerType()), StructField("OriginStateName", StringType()), StructField("OriginWac", IntegerType()), StructField("DestAirportID", IntegerType()), StructField("DestAirportSeqID", IntegerType()), StructField("DestCityMarketID", IntegerType()), StructField("Dest", StringType()), StructField("DestCityName", StringType()), StructField("DestState", StringType()), StructField("DestStateFips", IntegerType()), StructField("DestStateName", StringType()), StructField("DestWac", IntegerType()), StructField("CRSDepTime", StringType()), StructField("DepTime", StringType()), StructField("DepDelay", DoubleType()), StructField("DepDelayMinutes", DoubleType()), StructField("DepDel15", DoubleType()), StructField("DepartureDelayGroups", IntegerType()), StructField("DepTimeBlk", StringType()), StructField("TaxiOut", DoubleType()), StructField("WheelsOff", StringType()), StructField("WheelsOn", StringType()), StructField("TaxiIn", DoubleType()), StructField("CRSArrTime", StringType()), StructField("ArrTime", StringType()), StructField("ArrDelay", DoubleType()), StructField("ArrDelayMinutes", DoubleType()), StructField("ArrDel15", DoubleType()), StructField("ArrivalDelayGroups", IntegerType()), StructField("ArrTimeBlk", StringType()), StructField("Cancelled", DoubleType()), StructField("CancellationCode", StringType()), StructField("Diverted", DoubleType()), StructField("CRSElapsedTime", DoubleType()), StructField("ActualElapsedTime", DoubleType()), StructField("AirTime", DoubleType()), StructField("Flights", DoubleType()), StructField("Distance", DoubleType()), StructField("DistanceGroup", IntegerType()), StructField("CarrierDelay", DoubleType()), StructField("WeatherDelay", DoubleType()), StructField("NASDelay", DoubleType()), StructField("SecurityDelay", DoubleType()), StructField("LateAircraftDelay", DoubleType()), StructField("FirstDepTime", StringType()), StructField("TotalAddGTime", StringType()), StructField("LongestAddGTime", StringType()), StructField("DivAirportLandings", StringType()), StructField("DivReachedDest", StringType()), StructField("DivActualElapsedTime", StringType()), StructField("DivArrDelay", StringType()), StructField("DivDistance", StringType()), StructField("Div1Airport", StringType()), StructField("Div1AirportID", StringType()), StructField("Div1AirportSeqID", StringType()), StructField("Div1WheelsOn", StringType()), StructField("Div1TotalGTime", StringType()), StructField("Div1LongestGTime", StringType()), StructField("Div1WheelsOff", StringType()), StructField("Div1TailNum", StringType()), StructField("Div2Airport", StringType()), StructField("Div2AirportID", StringType()), StructField("Div2AirportSeqID", StringType()), StructField("Div2WheelsOn", StringType()), StructField("Div2TotalGTime", StringType()), StructField("Div2LongestGTime", StringType()), StructField("Div2WheelsOff", StringType()), StructField("Div2TailNum", StringType()), StructField("Div3Airport", StringType()), StructField("Div3AirportID", StringType()), StructField("Div3AirportSeqID", StringType()), StructField("Div3WheelsOn", StringType()), StructField("Div3TotalGTime", StringType()), StructField("Div3LongestGTime", StringType()), StructField("Div3WheelsOff", StringType()), StructField("Div3TailNum", StringType()), StructField("Div4Airport", StringType()), StructField("Div4AirportID", StringType()), StructField("Div4AirportSeqID", StringType()), StructField("Div4WheelsOn", StringType()), StructField("Div4TotalGTime", StringType()), StructField("Div4LongestGTime", StringType()), StructField("Div4WheelsOff", StringType()), StructField("Div4TailNum", StringType()), StructField("Div5Airport", StringType()), StructField("Div5AirportID", StringType()), StructField("Div5AirportSeqID", StringType()), StructField("Div5WheelsOn", StringType()), StructField("Div5TotalGTime", StringType()), StructField("Div5LongestGTime", StringType()), StructField("Div5WheelsOff", StringType()), StructField("Div5TailNum", StringType()) ])

That’s a lot of lines, but it’s a complete schema for the Airline On-Time Performance data set. In order to leverage this schema to create one data frame for each CSV file, the next cell should be:

import itertools year_list = ['2005','2006','2007','2008','2009','2010','2011','2012','2013','2014','2015'] month_list = ['1','2','3','4','5','6','7','8','9','10','11','12'] air_df_dict = {} print('Gathering files ...') for (year_str,month_str) in list(itertools.product(year_list,month_list)): year_month_str = '%s_%s'%(year_str,month_str) print('%s, '%(year_month_str), end="") air_df_dict[year_month_str] = spark.read.csv( 'qfs://master:20000/user/michael/data/airline/On_Time_On_Time_Performance_%s.csv'%(year_month_str), header=True, schema=air_schema, escape='"') print('Done!')

What this cell does is iterate through every possible year-month combination for our data set, and load the corresponding CSV into a data frame, which we save into a dictionary keyed by the year-month. However, these data frames are not in the final form I want. There are a number of columns I am not interested in, and I would like the date field to be an actual date object.

from datetime import datetime from pyspark.sql.functions import col, udf, unix_timestamp from pyspark.sql.types import DateType from pyspark import StorageLevel convertToDate = udf(lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType()) airline_data_parts = [] # Should really coalesce to 1 here, but that strains the ODROID XU4 cluster too # much. print('Processing ', end="") for year_month_str, air_df in air_df_dict.items(): print('%s, '%(year_month_str), end="") airline_data = air_df.select( "Year","Quarter","Month","DayofMonth","DayOfWeek","FlightDate","UniqueCarrier","AirlineID", "Carrier","TailNum","FlightNum","OriginAirportID","OriginAirportSeqID","OriginCityMarketID", "Origin","OriginCityName","OriginState","OriginStateFips","OriginStateName","OriginWac", "DestAirportID","DestAirportSeqID","DestCityMarketID","Dest","DestCityName","DestState", "DestStateFips","DestStateName","DestWac","CRSDepTime","DepTime","DepDelay","DepDelayMinutes", "DepDel15","DepartureDelayGroups","DepTimeBlk","TaxiOut","WheelsOff","WheelsOn","TaxiIn","CRSArrTime", "ArrTime","ArrDelay","ArrDelayMinutes","ArrDel15","ArrivalDelayGroups","ArrTimeBlk","Cancelled", "CancellationCode","Diverted","CRSElapsedTime","ActualElapsedTime","AirTime","Flights","Distance", "DistanceGroup","CarrierDelay","WeatherDelay","NASDelay","SecurityDelay","LateAircraftDelay" ).withColumn( 'FlightDate', convertToDate(col('FlightDate')) ) airline_data_parts.append(airline_data) print('Done!')

Finally, we need to combine these data frames into one partitioned Parquet file. The way to do this is to use the union() method on the data frame object which tells spark to treat two data frames (with the same schema) as one data frame. This method doesn’t necessarily shuffle any data around, simply logically combining the partitions of the two data frames together. Since we have 132 files to union, this would have to be done incrementally. Once we have combined all the data frames together into one logical set, we write it to a Parquet file partitioned by Year and Month. Note that this is a two-level partitioning scheme. Since the sourcing CSV data is effectively already partitioned by year and month, what this operation effectively does is pipe the CSV file through a data frame transformation and then into it’s own partition in a larger, combined data frame. No shuffling to redistribute data occurs.

master_data = airline_data_parts[0] print('Unionizing data frames 0, ', end="") for i in range(1,len(airline_data_parts)): print('%d, '%(i), end="") master_data = master_data.union(airline_data_parts[i]) print(" Done!") print('Starting export to HDFS...') master_data.write.partitionBy( "Year","Month" ).parquet( 'qfs://master:20000/user/michael/data/airline_data', mode='overwrite' ) print('Done!')

On my ODROID XU4 cluster, this conversion process took a little under 3 hours. However, the one-time cost of the conversion significantly reduces the time spent on analysis later. While we are certainly jumping through some hoops to allow the small XU4 cluster to handle some relatively large data sets, I would assert that the methods used here are just as applicable at scale. You always want to minimize the shuffling of data; things just go faster when this is done. Therein lies why I enjoy working out these problems on a small cluster, as it forces me to think through how the data is going to get transformed, and in turn helping me to understand how to do it better at scale.

The last step is to convert the two meta-data files that pertain to airlines and airports into Parquet files to be used later. These files were included with the either of the data sets above.

from pyspark.sql import Row def mapAirlineIdRow(r): airline_id = int(r.Code) airline_name_parts = r.Description.split(':') airline_name = airline_name_parts[0].strip() iata_carrier = airline_name_parts[1].strip() out = Row( AirlineID=airline_id, AirlineName=airline_name, Carrier=iata_carrier ) return out; airline_id_csv = spark.read.csv( 'qfs://master:20000/user/michael/data/airline/airline-id-lookup-table.csv', header=True, escape='"' ) airline_id_df = airline_id_csv.rdd.map(mapAirlineIdRow).toDF().coalesce(1) airline_id_df.write.parquet( 'qfs://master:20000/user/michael/data/airline_id_table', mode='overwrite' ) airline_id_df.take(1) airport_schema = StructType([ StructField("Code", StringType()), StructField("Description", StringType()), ]) def mapAirportIdRow(r): airport_id = r.Code airport_city = '' airport_name = '' airport_name_parts = r.Description.split(':') if len(airport_name_parts) is 2: airport_city = airport_name_parts[0].strip() airport_name = airport_name_parts[1].strip() elif len(airport_name_parts) is 1: airport_city = airport_name_parts[0] airport_name = r.Code out = Row( AirportID=airport_id, City=airport_city, Name=airport_name ) return out; airport_id_csv = spark.read.csv( 'qfs://master:20000/user/michael/data/airline/airport-information.csv', header=True, escape='"', schema=airport_schema ) airport_id_df = airport_id_csv.rdd.map(mapAirportIdRow).toDF().coalesce(1) airport_id_df.write.parquet( 'qfs://master:20000/user/michael/data/airport_id_table', mode='overwrite' ) airport_id_df.take(1)

All this code can be found in my Github repository here.

Next I will be walking through some analyses f the data set.