On the surface, it may seem like a lot has changed in recent years in regards to data collection, storage and warehousing. The introduction and takeover of NoSQL, “Big Data”, Graphing and Streaming technologies may appear to have changed the landscape, but there are some fundamentals that remain.

In my current role, we use Amazon Redshift for our data warehousing. However, whether we built a traditional data warehouse using Oracle or a data lake in Hadoop, the core architecture would remain the same.

The core architecture boils down to some preprocessing and three separate areas (schemas if you’re using redshift) called Staging, Master and Reporting. In this post I’ll talk through each in detail.

Preprocessing

Unfortunately, not all data is created equally, but it’s data nonetheless and therefore carries value.

In order to deal with the complexities of external data, some pre-processing is almost inevitable, especially when collection from a number of different sources. The main goal of the pre-processing step is to get the data into a consistent format that can be loaded by the data warehouse.

This includes, but isn’t limited to:

Converting Excel spreadsheets to CSV

Parsing JSON data (we tend to process each object onto one row in a single column and let Redshift parse it, but you could parse it upfront too)

Cleaning up bad or erroneous data files

Once ready, you will need a central place to put these files ready for loading into the data warehouse.

One example could be putting all files into an Amazon S3 bucket. It’s versatile, cheap and integrates with many technologies. If you are using Redshift for your data warehouse, it has great integration with that too.

Staging

The staging area is the bread and butter to any data warehouse.

A good data warehouse takes data from many different sources. Each data source comes with its own nuances, styles and naming conventions.

The staging area is the place to bring all this in — most likely from where you put it after pre-processing (but not always) — and store it transiently until it’s processed further down the line.

Like the loading area at an actual warehouse. The place where freights unload isn’t the final destination or final form of the materials or products. It’s just a holding area.

Photo by Hannes Egler on Unsplash

It simply allows you for the first time to have all the data within the bounds of the warehouse ready for further processing and modelling.

My personal opinion is that data in the staging area should be as close to raw as possible (again, you have to make some changes when pre-processing, but that shouldn’t change what the raw data tells you). You may even want to keep the original column names and table names the same. This makes it easier to trace back when investigating or reporting issues in the source.

The staging area should also be seen as transient.

You should retain data for a selected period of time in the staging area after which it should be purged. For example you might keep a rolling window of one months worth of data in case of failed loads or any other investigation.