I have a huge interest in Data Lakes, especially when it comes to the query engines that are capable of querying cloud object stores like Spark, Presto, Hive, Drill, among others. With that being said, when Google Cloud announced that BigQuery has beta support for querying Parquet and ORC files in Google Cloud Storage, it really peaked my curiosity. Thus, had no choice but to find a large dataset in Parquet format and try to query it with BigQuery. Sounds easy enough, right?

To get started, I needed to find a large Hive Partitioned Dataset to use. After some quick digging and searching, I wasn’t able to find one so the only logical thing to do was to create my own. One of my favorite features of BigQuery is the fact that it has tons of public datasets available to use for these sort of things. Being that I spend most of my days working in NYC, I have always found the NYC Taxi Data particularly interesting, so thought why not start there? For reference, the name of the dataset is bigquery-public-data:new_york_taxi_trips . This Dataset contains taxi rides partitioned by taxi company and year. For the purposes of this post, I will be using tlc_yellow_trips_2018 table because it is the most recent and has nearly 18GBs of raw data.

Creating Hive Partitioned Data in GCS using Spark and BigQuery

With an interesting table in mind, the next step is to create a Hive Partitioned version of it on Google Cloud Storage in the Parquet format. There are countless ways to handle this, again for the purposes of this post, I decided to use a simple Spark Shell script running on a Cloud DataProc cluster.

The first step is to spin up a Cloud DataProc cluster using the glcoud command line:

gcloud dataproc clusters create cluster-taxidata-extractor \

--region us-central1 \

--subnet default \

--zone us-central1-a \

--master-machine-type n1-standard-4 \

--master-boot-disk-size 500 \

--num-workers 2 \

--worker-machine-type n1-standard-4 \

--worker-boot-disk-size 500 \

--image-version 1.3-deb9 \

--project ${PROJET_ID}

The above command will spin up 3 node cluster each having 4 vCPUs and 15GBs of memory providing YARN with 8 cores and 24GBs of memory. This seems like more than enough horsepower for the task. Once the cluster is operational, ssh in with the following command:

gcloud compute ssh ${HOSTNAME} --project=${PROJECT_ID} --zone=${ZONE}

Once in, need to spin up Spark Shell, but with one caveat, namely adding the spark-bigquery-connector to its environment. This is necessary in order to leverage the latest and greatest Big Query Storage APIs. The final command is:

spark-shell --jars gs://spark-lib/bigquery/spark-bigquery-latest.jar

Once in, it is easy to start exploring the data to figure out the best way to partition data. After a bit of review, I decided the most logical way to partition it is by ride date. There are two fields that can be used to achieve this, either pickup_datetime or dropoff_datetime . I decided to use pickup_datetime taking into account that some rides may start one day and end another day, I.e, 11.45p to 12.30a, these rides will be counted on the day they originated. There is one wrinkle in this decision, namely, the spark-bigquery-connector doesn’t have a native type to cast BigQuery’s DATETIME into so it simply casts it in to a STRING , which is not very useful as a partition key. Thus, custom code is needed to perform the cast from a DATETIME in to a DATE . The final code looks something like:

spark.read.format("bigquery")

.option("table", "bigquery-public-data:new_york_taxi_trips.tlc_yellow_trips_2018")

.load()

.withColumn("trip_date", (col("pickup_datetime").cast("date")))

.write.partitionBy("trip_date")

.format("parquet")

.save("gs://${PROJET_ID}/new-york-taxi-trips/yellow/2018")

The above code is pretty self explanatory. With that being said, let’s quickly walk through it. First, we load the table from BigQuery in to a DataFrame, next is the cast mentioned above, followed by partitioning information and file format, finally saving it. After hitting return, I went to grab a coffee. By the time I got back to my desk, I had 18GB of Taxi ride data partitioned by trip_date in my GCS bucket already. That was easy

For reference the files should look something like:

❯ gsutil ls gs://${PROJET_ID}/new-york-taxi-trips/yellow/2018

gs://${PROJECT_ID}/new-york-taxi-trips/yellow/2018/

gs://${PROJECT_ID}/new-york-taxi-trips/yellow/2018/_SUCCESS

gs://${PROJECT_ID}/new-york-taxi-trips/yellow/2018/trip_date=2018-01-01/

gs://${PROJECT_ID}/new-york-taxi-trips/yellow/2018/trip_date=2018-01-01/UUID.snappy.parquet

gs://${PROJECT_ID}/new-york-taxi-trips/yellow/2018/trip_date=2018-01-01/UUID.snappy.parquet

gs://${PROJECT_ID}/new-york-taxi-trips/yellow/2018/trip_date=2018-01-02/

gs://${PROJECT_ID}/new-york-taxi-trips/yellow/2018/trip_date=2018-01-02/UUID.snappy.parquet

gs://${PROJECT_ID}/new-york-taxi-trips/yellow/2018/trip_date=2018-01-02/UUID.snappy.parquet

...

Pro Tip: This would be a great time to shut down the DataProc cluster since it is no longer needed.