Data drives so much of what we do here at Grubhub. Every order, every dish you search for, every favorited restaurant tell us something about our diners and what they want to eat. But there’s two types of data: the source data we gather and provide when a customer orders from one of our apps and the compiled data our analysts use to improve our apps.

These two sets of data contain the same information but differ in form and scope. When you open up Grubhub.com looking to order pad thai, you see your favorite restaurants, search results for restaurants for that provide pad thai, and your previous orders that may or may not have been pad thai. When an analysts looks at data, they want to see when people order pad thai the most, what restaurants are most popular for pad thai, and which diners regularly order pad thai.

As you can see, the source data serves a single data point and needs to serve it fast. Analytics data gathers data from a lot of points — not only different production systems, but CRM systems, customer care systems, third party analytics, etc. Pulling everything into one spot is very powerful, but it can be resource heavy and slow.

To get from our source data to analytics data without affecting performance for our diners requires a data pipeline.

Building a data pipeline

For the last few years, the standard business practice was to setup an ETL pipeline. The steps are:

Extract: take data from a source system.

Transform: modify the data in some way. This can be renaming fields, combining data, calculating metrics, refactoring to a different granularity, etc.

Load: save the data to a new (target) system.

Not all companies need to build this kind of pipeline. If you are a young company/startup and still have questions about your product and market fit, do not spend any time setting up a sophisticated pipeline. All you need is a set of SQL scripts that run as a cron job against the production data at a low traffic period and a spreadsheet.

But once your product has found a market, you’re going to want to grow your share of that market. It’s at this growth stage where laying down some more dedicated infrastructure for data pipelining and analytics starts to make sense. Your customers give you data every day; analytics helps you use that data to drive improvements to your product.

How your pipelines and analytics are constructed is going to vary on your use case and scale. However there are some patterns to follow.

The start of your pipeline will be your customer-facing source data. We mostly use Cassandra, as it performs fast in high-availability environments while allowing seamless multi-region warehouses. But it’s not indexed, which means we tailor each database for each customer query. Even if you use an indexed database like MySQL, you don’t want these databases constantly bombarded with requests, potentially affecting customer experiences.

That’s where the first step, extract, comes in. Most companies set up some sort of scheduler to execute various pieces of code, called jobs. During slow periods, these jobs run and extract all of the data gathered during the day. Again, this data comes from production systems, CRM systems, third party data providers, custom care interactions, and more.

Once you’ve extracted that data, another set of jobs comes through and transforms it using any amount of business logic. Personally, I think transform is the most critical step in the entire process. If you take enough due diligence to be consistent in naming, provide accurate timestamps (in the right timezones!), filter out junk data, obfuscate sensitive information, calculate baseline metrics, and provide enough numbers for new metrics to be calculated, the discoverability and usability of the data in the warehouse increases tenfold. In turn, this increases the efficiency of everybody analyzing the data in the warehouse.

Finally, a job loads this transformed data into a data warehouse that your analysts can access when they need to. Because this warehouse doesn’t face customers, it can perform slower.

There are plenty of enterprise solutions doing ETL like this. The various cloud providers have offerings to help set this up, plus there are plenty of good open source tools that can be installed and customized to your environment.

The limits of traditional ETL

This sort of pattern worked well for many years for us, and continues to work for hundreds of companies. It pulls all data into one place and provides excellent separations between the transactional system and the analytical system. However, there are some pitfalls to be aware of.

First and foremost, every piece of data that needs to flow into the data warehouse will have some sort of code that goes along with it. This can result in the following issues:

Long wait times and backlogs to get data from source into the data warehouse.

Changes at the source could break jobs, which impacts the business and can lead to developer burnout. Constantly fixing breakages is no fun.

Difficulties in QAing data accuracy against un-transformed raw data.

Typically this is done in batch mode, which means all the data will be processed at once on some regular basis (typically a day, but could be hourly or weekly). As the amount of data increases, this puts strains on the systems and can cause performance failures. When this happens, your data pipeline can be more trouble than it’s worth.

You can address some of these pitfalls in a number of ways. The first way is to get rid of the complex workflow engine that transforms the source data. Throw together a few SQL scripts, and/or use one of the various SaaS providers to pull data in from the source and put them in the warehouse. Then transforms are done on query or built into the different views you setup.

This eliminates the resource-intensive transformation portion of the ETL pipeline. Your data moves directly to the data warehouse, where it can be accessed as needed by analyst queries. No backlogs, no broken transform jobs, no QA nightmares. Just the data, all in one place.

For a company in the middle of a growth stage, this is a completely appropriate milestone. You may still have plenty of unknowns in the data and business model which make the transform queries change rapidly. Your source data may be changing just as rapidly. Your data pipeline needs to be flexible enough to adapt to this change. You can use third party services or some very generic scripts to pull in the data and will ensure the reliability of the pipeline.

However, as your business crystalizes, your metrics get more complex and the amount of data you have starts growing exponentially. You’ll want to codify key metrics and data points in a standard fashion to maintain reliability across your growing organization. Just loading all of your data into a warehouse and hoping that the queries built on top of it remain consistent won’t cut it. You’ll need some way to manage the data before analysis.

However, injecting a scheduler workflow before that data warehouse (as in the ETL flow above) could potentially compromise some of the pieces already built. So to get transform processes in your pipeline, you can add the scheduler / workflow engine after the raw data already sits inside a data store.

This kind of architecture has a lot of benefits:

Data governance becomes easier, as access controls can be different between the raw data and the transformed data.

The transformations allow for active development and pre-calculation of important business metrics.

Each piece in the pipeline (shown above) is much easier to scale.

This pipeline isn’t for everybody — there is definitely some complexity and cost associated with this architecture. You should have already reached some amount of scale before laying this out. However, it’s a good pattern to keep in mind as you are setting up your pipeline, as your early decisions will make this kind of architecture either difficult to setup or easy.

You can make these pipeline transitions much smoother by creating an explicit interface — usually an API contract — between your production application and your data. This lets you develop for the business needs of your application and the data in parallel. These contracts can make getting data in and out of your warehouse much smoother as your business needs grow and mature. Realtime/streaming data becomes easier to adopt when your business is ready for it.

Without this explicit interface, your production schema becomes the de facto interface. At that point, you’re stuck with it without some serious rework. Development work can be slow as every change on both sides of the pipeline needs to account for the existing schema. The pipeline itself can become very brittle, as every change to application data can break the pipeline. As companies grow, more engineers means more opportunities to forget to communicate changes.

How we pipeline at Grubhub

At Grubhub, we currently have a pretty typical ETL pipeline setup with a homegrown ETL scheduler that populates Amazon Redshift. But we found that this pipeline started to become cumbersome to maintain as we started to reach some amount of scale. Also it became difficult to track where transformations were happening. So we are transitioning to something that looks more like the last architecture presented above.

Our new architecture extracts data into a data lake on Amazon’s S3. Then, our scheduler — Azkaban — transforms the data for analysis. Azkaban both populates the raw data lake as well as our data warehouse, for which we use Hive. We have an excellent team of data engineers that works closely with our business users to come up with a defined set of transformations to present as our standard business object.

This allows us to decouple the source ingestion from the business level transformations. Allowing one set of engineering to concentrate on optimizing the flow from our production system into Hive tables and another set of engineers working on optimizing the data for the business.

Also this means that as services are spun up on our excellent microservice architecture the access to source data is almost immediately available. This allow for insights into the data before engineers can transform it. Mitigating the downsides of the traditional ETL where business users would have to wait week or even days before measuring a new piece of the business.

In closing

ETL pipelines are constructed so we can get faster insights into the business from our data. This is true no matter the scale. The architecture of the pipeline will depend on the scale of the business as well as how well the company has reached product/market fit. It’s valuable to know all kinds of architecture so you can pick what is appropriate for you today, but also put down building blocks for what may be coming up in the future.

We’re still learning and growing our pipeline here at Grubhub as well, and we’ve definitely learned from some mistakes along the way. We’d love to hear from you on how you’ve architecture your pipeline and answer any question you have on how you may want to approach your pipeline today.