Prologue

About a year ago, a small team at REA – mostly data engineers and data scientists – was entrusted with the task of designing a new enterprise Data Warehouse for REA in BigQuery. Some of us still wore battle scars from similar, previous endeavours with operating data warehouses. Even then, as does any new venture, this one filled us with a new thrill and energised us with a prospect to get everything “just right” this time in our brand new stack. The rest of this story cuts to the present and offers an anthology of technical and strategic decisions, narrated in a retrospective tone.

Present Day

A logical view of our current Data Warehouse in BigQuery:

Context

Since we are dealing with ambiguous (for some, even contentious) industry jargon, let us begin by establishing what they mean exactly in our context:

Definitions Data Lake A single store of all enterprise data in its rawest form. The essence of a data lake is to serve as staging area for data publishers to stream or sink data before applying complex ETL (Extraction, Transformation, Load). This data is usually raw consumer events/logs from IOT devices, IP logs etc. The storage systems for data lakes have evolved in the last fews years. Most common candidates are distributed file systems like S3/GCS or HDFS based storage and more recently BigQuery’s native storage engine. The data in a data lake typically comes with absolutely no guarantees with respect to schemas, uniqueness constraints, or other bespoke cleaning heuristics. Data Mart The data mart is usually a subset of the data warehouse and is often oriented to a specific business unit. The data in the data lake is raw and untamed which makes it unwieldy and less appealing for general consumption. Data marts exist to solve this problem. They facilitate broader access within a business unit by abstracting away local cleaning rules and aggregations into a TL (Transformation and Load) process. The end result is analogous to a shop front where analysts are able to consume data on various domains that they care about. The important distinction here is that the cleaning rules and aggregations applied are local to the business unit. Data Warehouse A Data Warehouse is a collection of Data Marts There are two competing schools of thought on how a data warehouse should evolve. The Inmon idea (proposed by Bill Inmon) suggests that a data warehouse should be designed first as a foundation on which subsequent use cases from different business units can be built out. E.g. get all your foundation data created before building the next layer of data marts. This approach runs contrary to how we would normally run an agile project putting a larger emphasis on upfront design. The second big idea is almost the exact opposite – proposed by Ralph Kimball – called the Kimball approach. In Kimball’s view, organisations would first create data marts based on local needs and requirements and then combine and adapt it for global use of the organisation. Thankfully, both of them have a solid agreement on what the term “data warehouse” actually means: A Data Warehouse is a collection of Data Marts.

Things we reflected on

Reflection 1: Do we need a Data Lake?

Premise

In a classical data warehouse architecture, as the raw data arrives, it is immediately cleaned and organised into carefully designed schemas. These cleaning rules are usually opaque to data consumers (analysts). Modern data architectures, however, often feature a data lake which offers data consumers a choice to access and use untouched raw data for analytical purposes. The back-drop of this discussion also included an earlier decision that we had made to use BigQuery as our platform of choice for building the foundations of our data warehouse.

Conclusion

We ran an “interested parties review” internally with data consumers and publishers and the choice was unanimous: we want a data lake. From the data consumers’ perspective, they were thrilled to have a choice to access both raw and clean data. On the other hand, data publishers liked the idea of a “staging area” for data before it gets processed into other parts of the data warehouse – which in some ways, simplified their data architecture by decoupling E (Extraction) from TL (Transformation and Load).

Data storage costs also play a big role when deciding whether or not to have a data lake. BigQuery’s pricing model with relatively low storage costs and automatic life-cycle policies helped us cement that decision.

Reflection 2: Gated Data Lake

Premise

Right from the start, implementing an orthodox data lake was never an appealing prospect to me as data lakes are primarily driven by data publishers. Data lakes are often perceived by publishers as a convenient destination to sink some data in a hurry. For me, this conjures up a time-lapse image of something that starts off as a blue pristine lake and gradually fades into a cyber-punk grey swamp, complete with half-sunken oil drums.

Conclusion

We wanted data publishers to own up some responsibility of what they publish. We introduced an idea called the “Gated Data Lake” (GDL) – a similar concept to a Data Lake but with some rules of engagement. The rules are as follows:

All data published into the GDL are stored in BigQuery tables, All data published to GDL is immutable, All the tables in GDL have semantically versioned schemas, All the table schemas along with comprehensive documentation of the data is in source-control (accessible REA-wide), There is always one dedicated team that acts as a gate-keeper for the above rules.

By taking away some flexibility from the data publishers, we get the benefit of a data lake that data consumers can rely on. The data is still as raw as possible; no cleaning rules have been applied to it. This makes it appealing to data consumers who want to side-step global cleaning rules, either because they disagree with them or because they simply need to get to the raw source of truth. Data consumers were also thrilled by the promise that the shape of the data would not change under them without notice.

Data publishers on the other hand pay a small price any time they change the shape of the data (in the form of a pull-request with some documentation). But as the schema changes can be broadcasted in advance, publishers can be assured that they have not inadvertently broken an analyst’s report downstream.

Reflection 3: Star Schema or Snowflake Schema?

Premise

There are two popular schema designs for building data marts – Star and Snowflake.

As is common with most design choices, they make different trade-offs and are optimised for different needs. Star is easy to work with (less normalisation) but sacrifices data integrity and storage. Snowflake addresses data integrity and optimises storage very well but makes it hard to query (all tables are denormalised into smaller fractals).

Conclusion

Some of the classical data warehousing paradigms just do not cross over well to BigQuery. BigQuery was designed by Google with specific trade-offs in mind. For example, trying to implement a traditional third normal form table structure in BigQuery would often lead to a worse performance and is generally not recommended (JOINS are not performant). With features such as nested and repeated fields, and low storage costs, BigQuery makes a compelling case for keeping tables as denormalised as possible.

As discussed earlier, with our decision to go with “Gated Data Lake”, the raw data that comes in is immutable and without the need for updating and deleting rows, normalisation of data becomes less and less attractive. It is fair to state that the jury is still out on the type of schema design we will end up using. Snowflake seems to be where we are leaning towards.

Reflection 4: Chicken or egg? Data Marts first or Data Warehouse first?

Premise

As highlighted in the Data Warehouse definition, there are two options to evolve a data warehouse. There are plenty of trade-offs and benefits on both sides.

Conclusion

Conway’s law played a huge part in this decision. As an organisation, we are decentralised with a strong affinity for an “autonomy with cross-leverage” model. Building a data warehouse first and then adapting it for various business units would just not scale well with how we operate. At the same time, we were mindful of the duplication and divergence that autonomous data marts are notorious for. The final decision was to lean on the Kimball approach: building out individual data marts as new use cases emerge. But at the same time using BigQuery’s federated access model, we serve up interesting data assets across the organisation in an effort to reduce duplication and promote convergence.

Epilogue

While I am about to publish this article with seemingly logical conclusions, there still are a couple of dangling plot lines that need resolving:

How do we allow a degree of autonomy without risking “Tragedy of the commons” as we evolve Data Marts. How do we foster a data-sharing culture that cuts across departmental boundaries with a minimum of friction?

Keep an eye out for the sequel.

Acknowledgements

This work would not have been possible without the tireless hard work and contributions from people across the organisation. Here are a few of the names that have had the most direct impact (in random order):