The data ops team should provide tools to lower the barrier for all your employees, not only the Java or Python developers.

ETL/ELT should be easy to do for everyone, maybe via SQL instead of Java or maybe using a data integration tool that has a drag and drop interface, such as [CDAP] or [Matillion].

Make sure it is easy to adopt machine learning. Nowadays, you can achieve great results with AutoML products or by writing ML directly in SQL. [BigQuery ML]. Of course, if your ML experts want to train and deploy custom-made models, it should be possible too. Have a look at [Kubeflow] maybe.

References:

Democratizing data analysis, visualization, and machine learning in a secure way is a top priority [for the Data Platform team at Twitter].

Watch [this talk from Spotify] about “lowering the friction”, revealing that 25% (!!) of their employees now use the data warehouse.

4. About this semantic layer

Having a semantic layer seems like a good idea at first so that the kids know where to go to find their favorite book in the library.

In a legacy data warehouse, the data would typically be modeled as a Data Vault or Kimball star schema to support the semantic layer. But these methods are now [25–30 years old] and if the rise of NoSQL already disrupted data modeling once, big data and cloud might be another nail in the coffin for normalized data models.

The truth is that it is really hard to maintain an overarching semantic layer that makes sense for everyone in the company and is easy to query for everyone. Over time there is even a risk that this semantic layer slows down your analytics because of its lack of flexibility.

To illustrate this, let’s take an example and define a semantic layer to represent an app like Spotify, see below.

Given this simplified snowflake schema, here is an example request: give me the top 10 artists played in Sweden in 2018.

The data model should allow for such query without having to join 5 or 6 tables in a complex and error-prone query that will end up scanning your entire dataset. The queries you make should define your data model.

Wouldn’t it be easier, for someone who is not an expert at writing SQL queries, to find the top 10 artists from a generated <top_played_artist> table instead?

To be truly data driven, you have to free the data, and let different teams copy paste it, build pipelines and slice it as they please, to solve their problems, instead of forcing them to format it according to a generic semantic law.

This is where data contracts comes into play.

Note that I am not completely dismissing the idea to keep a semantic layer. It can be useful if you don’t know what you are looking for. For data exploration in a BI dashboard, a semantic layer representing different data points coming together, can make sense. But it should be defined as a logical layer and not enforce any specific schema on the physical tables. [LookML].

Additional reading:

[Is Kimball dimensional modeling still relevant in a modern data warehouse?]

5. Data contracts

You can say that the [API Mandate] is what changed Amazon and gave birth to AWS. I believe that if you want to build a data driven company, you should look at adopting similar principles and treat data almost like an API endpoint. Your teams developing ETL/ELT pipelines should start communicating using data contracts, measured by [SLOs].

Going back to our top 10 artists example above, here is a typical process to solve this problem using data contracts:

First, let’s not reinvent the wheel. Maybe there is a data endpoint (= shared dataset) already created by another team, that solves our problem. Look for it in your [Data Catalog]. If no existing dataset is found, go talk to the team that owns the data you need to calculate the top 10 artists. If multiple teams have this data, find the source of truth. Two options:

a) Looking at the company data lineage, the source of truth could be the origin (most upstream dataset). Often, it is a data dump or [CDC] from an OLTP system.

Needless to say: don’t connect your data pipeline directly to their OLTP system.

b) Looking at the data governance metadata tags, there could be a (downstream) dataset promoted as the source of truth.

This is called MDM (Master Data Management) and I recommend [this excellent read] on the topic if you are moving to BigQuery.

3. Once you find the team to talk to, agree on a data contract and make them expose a data endpoint for you, exactly like you would build a loosely coupled microservice architecture between the two teams.

If the team 2 above is not versed in data engineering, maybe team 1 can calculate the top 10 artists instead and expose it in a dataset, or maybe you agree that another team 3 would do that for the rest of the company.

This data contract is about agreeing on a data schema and service-level objectives: availability, response time, quality, etc. By talking to each other, the two teams know that they have created this link, and if an update is required, they will have to agree on a new contract. This contract ensures that this implementation does not turn into a deprecated dependency in a few months when new engineers are hired.

You might have spotted that we denormalized our snowflake schema. We are trading a bit of consistency and storage space in order to make our queries simpler. There is no more complex semantic layer that have to accommodate both teams. The queries can now be written by a SQL beginner.

It is a best practice to version your schemas and upload them in a company schema registry that you reference in the data contract documentation.

6. Put guardrails and care about the cost later

You shouldn’t shy away from duplicating data using the data contract methodology described above. Storage in the cloud is cheap, you pay for what you use only, and can adjust the knob if cost becomes an issue.

Trying to optimize storage space using a star schema* in the cloud, is like caring about which TCP variants you use when you connect to the Internet, it was fun in the 90s. Instead, spend your engineering time on [over-the-top] use cases. Now that you are no longer limited by your previous data center capacity, start ingesting more logs, more data. It is perfectly fine to pay $1000 in storage if you gain 100x this amount in additional revenue thanks to data driven decisions.

* A star schema used to be a good idea to compress the fact table before columnar storage was invented :)

7. Data university

Convert some of your software/backend engineers to data engineers and run an internal [Data University] programme!

With framework like Apache Beam, if you are a software developer who knows Java or Python, you should be able to quickly learn how to create your first data pipeline.

8. Data quality is not optional

[Incorrect data is worse than no data]. To ensure data quality, you can:

Foster a quality-first mindset and entrust your data engineers to test what they are doing. It might function better than hiring a dedicated team of QA testers. Gamification can encourage the right behavior. For example: introduce different levels of test certification for a data pipeline, with a reward system when you level up. Read about [TC4D] at Spotify.

Break down your business process into multiple workflows. Now that you let the different product teams develop their own data pipelines, they will need some kind of orchestration tool like [Airflow]. In case something goes wrong and data needs to be reprocessed, Airflow can retry or backfill a given workflow.

It is often crucial to be able to trace back a calculation to something materialized, and Airflow can also draw the lineage of the different transformation steps (upstream inputs & downstream outputs).

[dbt] or [dataform] might be another alternative to look at if your pipelines are all written in SQL (or you can combine it with Airflow).

Create a test environment that stores a duplicate of the production data so you can run acceptance tests. Implement DevOps pillars such as CI/CD (run what you build), code reviews, [infrastructure as code]…

Verify that the tools you choose have some kind of testing framework included. For example: how do you write tests if all your pipelines are written in SQL or are implemented in a graphical user interface?

Here is an example of a [simple upstream health check] written in SQL.

Here is an example of a [simple upstream health check] written in SQL. Setup monitoring in Stackdriver or Grafana. Send alerts in case SLOs are not met. Here are a few things you can monitor:

- Is the data delivered on time? For example: at 8am every day in a daily partition.

- Is the data complete? Use [data counters].

- Is the data correct? It must be formatted according to the schema agreed in the data contract.

- Is the data consistent? Run a few end-to-end tests to check if data across different systems mean the same thing, otherwise the boat might sink.

Try to avoid mutations (DML) to avoid side effects, keep your raw files (ELT instead of ETL), partition all your tables, and keep your tasks “pure”, as it is described in the [functional data engineering] guide.

Reference: [How Spotify solved data quality]

9. Establish some controls

By letting the kids into the library, you have essentially created chaos :)

Data governance is an important piece of the puzzle to bring a bit of order to this chaos.

It is thus important to set some rules; a golden path written by your newly-formed data ops team. Empower your data engineers to be your guides in this transformation.

With data contracts all over the place, it is like with microservices, it is a big mess until you bring something like [Istio].

If you store personal information and need to anonymize it, how do you do if you have opened the gate to your data warehouse and data is all over the place? This might be a good solution: [crypto delete].

Some final advice on this topic: enable audit logs, set a retention period on your data, use [the principle of least privilege] and implement [ITGCs].

10. Discover new data

Do you know what Google Photos and Spotify have in common? They are both really great for discovering new photos and music!

A few years ago, when the first digital photographs and MP3 files started to appear, the apps we used back then mirrored our previous non-digital habits of crafting photo albums and creating mixtapes.

That’s because [as Elon Musk explains], it is mentally easier to reason by analogy rather than from first principles.

When moving to the cloud, you have to take the opportunity, to remove all your existing biases and known constraints. Until you are at a point where nothing can be deduced. You are back to first principles, and you can start to build back up a new and better solution. That mindset is what Google Photos and Spotify applied to their product development eventually, bringing new digital experience to life: instead of classifying photos, we now upload everything to the cloud because we take 500 photos of our kids per day and don’t have the time to sort them. And instead of listening to one or two playlists, we let Spotify suggests the song based on the mood we are in.

Access to unlimited photos and music transformed our habits. We could say that the same shift is happening with data. Companies are storing more and more data, running A/B tests, canary releasing and comparing key metrics along the way to validate their choices.

Back in the days, you had to carefully classify and delete redundant data otherwise your data warehouse would fill up too quickly. Nowadays, it matters less and just like Google Photos and Spotify, you should reinvent the way you work with data and focus on discovery.