Over the past year, we’ve been working hard to develop a new data pipeline to streamline our data flows and analytics across a growing set of applications and services. One of the key components of this infrastructure is the data warehouse, our source-of-truth for answering important questions about our business which people can access through a combination of Looker for data exploration and Excel for reporting.

Initially, our data warehouse was a Postgres server running on AWS. A single instance hosted extracts of several production datasets that were sourced from multiple database servers via batch dumps and imports. At the time, this setup enabled us to get up and running because our engineers were familiar with it. Since Postgres was already the primary database used by our applications, it was easy to leverage many of the same tools used for administration and debugging.

However, as data volumes and complexity grew rapidly, it became increasingly difficult to scale. Data warehousing queries and tooling assumed a single database instance, limiting us to vertical scaling of this single instance. We could pursue this scaling approach for a limited time by choosing ever-larger server classes and increasing storage throughput by purchasing a higher number of provisioned IOPS. Without the ability to shard tables or leverage any kind of parallelization, many of our queries took hours to complete and quickly became bottlenecks for important and time-sensitive analysis. We needed a faster, more scalable, and longer-term solution that would adapt to the needs of the business and data-hungry analysts.

Ultimately, we chose Google’s BigQuery due to its simplicity as a completely managed solution, cloud-friendly cost-structure, and proven capability as an interactive analytics store. Specifically, BigQuery is a columnar store that performs full table scans for every single query without the use of indices. At first glance, one might find this counterintuitive; BigQuery works by pushing queries down a massively parallel distributed tree across thousands of machines and then aggregating results very quickly, usually on the order of seconds.

So far, we’ve been very impressed with BigQuery. Many of our queries that used to take several hours now take less than a minute. In addition, BigQuery’s support for standard SQL 2011, built-in date-partitioning, and flexible schema support has helped minimize switching costs. Moreover, it has enabled our teams to focus on data quality and building better abstractions for analytics.

Loading Data into BigQuery

At a high level, our data flow looks something like this:

With a dynamic business like ours, it’s important for our analytics teams to be able to view and analyze changing dimensions over time, such as drop-off curves for registration and order rate. This is often difficult to achieve because our Postgres application instances only store current state of our entity data models and don’t retain a log of state changes over time. In order to work around this limitation, we perform full snapshots of our entity models on a consistent basis to help us preserve change-data that is necessary for time-oriented analyses. Prior to BigQuery, maintaining these snapshots over time would have been very cost prohibitive in terms of storage on a single server. With BigQuery, we don’t ever have to worry about running low on server disk space or manually deleting old data since storage is very cheap.

To load our Postgres data into BigQuery, we run a series of ETL jobs using Airflow to extract raw CSV dumps from our follower instances and write them to Google Cloud Storage (GCS) as GZIP compressed files, every 12 hours. During this process, we also dump the current schema of each table as a JSON file which is required when loading data into BigQuery. Once data and schema files are available in GCS, they are imported into BigQuery using the Jobs API which asynchronously copies resources directly from GCS.

Using Airflow, we’re able to programmatically implement this data flow as a directed acyclic graph (DAG) of tasks, with each task responsible for point-to-point data transfers. From the diagram above, a single Airflow task dumps the data and schema information from our Postgres follower into GCS and a subsequent task then loads this data from GCS to BigQuery.

Another important data source that we collect is event data which we use to analyze user behavior at a more granular level. In contrast to entities, events capture actions performed by entities at specific points in time and are often de-normalized with nested and repeated data. Our applications currently publish event data to Kafka which can be routed to any number of downstream consumers. Our streaming integration is greatly simplified by a BigQuery sink connector using the Kafka Connect framework, which we’ll discuss in part two of this series.

Partitioning Tables

A crucial differentiator from Postgres is BigQuery’s built-in support for date-based partitions. By allowing you to divide tables into smaller day-based partitions, you can reduce the number of bytes processed by restricting queries to specific dates. Before this feature was available, one had to manually shard a single table into multiple daily tables. As a result, queries would often be more complex because data would be spread across many different tables and performance would degrade as the number of referenced tables increased. Now, data is stored in one table and divided into segments through a pseudo column named _PARTITIONTIME, that contains a timestamp for data loaded into the table.

Setting up date-partitioned tables must be done at creation time and is usually handled ahead of time by invoking the following command:

bq mk — time_partitioning_type=DAY dataset.table

After tables have been set up properly ahead of time, writing date-partitioned data is as simple as adding the formatted date string as part of the destination table ID:

{dataset}.{table}$YYYYMMDD

This makes it easy to specify queries that directly operate over individual partitions which can be done by adding the partition suffix. For example, we can run the following query:

SELECT … FROM table$20161201

which is equivalent to

SELECT … FROM table WHERE _PARTITIONTIME = TIMESTAMP(“20161201”)

In addition to being able to look up partitions directly by time, we can use date-partitioning to join information across time to augment historical information with current data. A real-world example of this is the way we track the movement of scheduled boxes by facility over time as they approach the actual ship date. From the time an order is scheduled to when an order is received, there can be any number of changes that occur such as boxes that move across facilities or users who skip or un-skip their deliveries. We can conveniently query these partitions to generate a time-based view that displays box counts for each day, as shown in the following example:

Date-partitioning has greatly simplified our load process by allowing us to collect snapshot data without worrying about the mechanics of sharding. By partitioning data by time, we can also take advantage of long-term pricing: partitions older than 90 days offer 50% lower storage pricing without sacrificing query performance.

Managing Schema Updates

One of the biggest challenges for our ETL pipeline is staying resilient to schema updates that occur upstream. BigQuery is able to handle schema updates in cases where new data fields are added or existing required fields are relaxed. The schema of the destination table can be updated dynamically by passing the following configuration options when the BigQuery load job is invoked:

ALLOW_FIELD_ADDITION: Allows adding a nullable field to the schema.

ALLOW_FIELD_RELAXATION: Allows relaxing a required field in the original schema to nullable.

When new fields are added to or removed from tables, BigQuery can appropriately apply null-ables to ensure that existing data conforms to the new schema. This offers us a great deal of flexibility because we don’t have to perform any schema or data migrations.

Not all schema changes can be handled gracefully, such as column renames or semantic changes to the underlying data model. In these cases, there is no automated solution, so these changes are best handled by coordinated migration strategies or database views. In a future post, we’ll discuss some strategies we’ve adopted to help mitigate the downstream impact of schema changes.

Final Thoughts

Overall, we’ve been incredibly happy with BigQuery. It allows us to effortlessly handle large amounts of data and scale query processing in ways that we wouldn’t be able to with other solutions. As our business and data keeps growing, we’re confident that BigQuery will continue to deliver a high-quality interactive analytics experience so our teams can spend more time digging for insights rather than waiting for hours to get answers.

Stay tuned for part two of this series, where we’ll further explore how we think about, process, and ingest streaming event data into BigQuery to enable advanced analytics for personalization and beyond.