Ways to democratize data in your organization

The aim of this section is to discuss some, non mutually exclusive, ways to democratize data in an organization. Some of them could be useful to your organization, some would not be adequate, but hopefully, it helps you formulate your own solutions!

Photo by Silas Köhler on Unsplash

Team structures

The decentralized Data Team

Embedding Data Engineers in each team allows teams to be self sufficient. The technical barrier of exploring data and creating data pipelines might still be there, but each team now has the skills needed to be independent.

These Data Engineers also become familiar with the team’s mission and get domain expertise which enable them to better understand requirements.

The Data Platform Team

A team focused on data infrastructure and tooling can help reduce the technical skills needed to perform common tasks. This tooling can be “off-the-shelf”, open source or developed by the team.

Examples of “data platforming” projects:

A Web Service to allow anyone to ingest arbitrary files into the data-lake and define schemas.

Hosting Airflow and providing it as a service to other teams.

Abstracting the complexities of defining cluster resources for Spark by providing templates for EMR (e.g. I want a “small” cluster instead of: I want a cluster with 1 r5.2xlarge master node on demand, 5 r5.4xlarge Spot fleet…).

Metadata Hubs

A “Metadata Hub” typically takes the form of an internal search engine, which catalogues datasets, and exposes information a potential data consumer needs in order to use a table. Minimally, it should answer the following questions:

What the table contents are (including column-level information).

are (including column-level information). Who the table owner is (team).

is (team). How healthy the table is (comment on the table, or QC score).

the table is (comment on the table, or QC score). What the data lineage is (e.g. link to an Airflow DAG, or Atlas).

is (e.g. link to an Airflow DAG, or Atlas). What the update strategy on the table is (e.g. daily snapshot, weekly incremental update).

Implementations of Metadata Hubs

Many companies have implemented a Metadata Hub solution:

Amundsen, DataHub and Metacat are open source and available on GitHub, and leveraging those open source tools can make it easier for Data Engineering teams with limited resources to support a Metadata Hub.

The following article goes more in-depth on the architecture of each project:

Data Quality checks (QC)

Ensuring high data quality when the creation of tables is distributed can be challenging. The QC processes also need to be distributed and the people creating tables should be able to QC the tables they own!

At Earnest, we have 2 in-house QC tools + 1 machine learning enabled anomaly detection tool which distribute emails and information about the health of tables. However, we found it hard to scale to the analysts’ demand for more QC checks as they required engineering time.

We have started to explore Great Expectations (GE) and we hope to scale by enabling analysts to take part in the QC effort.

Read more about Great Expectations here:

The Data Engineering Role

Theoretically if any potential data consumer in the company was given the tools they need to write ETL, it would be one of the biggest enablers of data democratization. So, naturally, this is something I am very interested in (and you should too!).

“Engineers should not write ETL. For the love of everything sacred and holy in the profession, this should not be a dedicated or specialized role. There is nothing more soul sucking than writing, maintaining, modifying, and supporting ETL to produce data that you yourself never get to use or consume.” (source)

Data Engineers should help simplify the process of writing ETL, and the organization should train data consumers to use the tools provided by the Data Engineers.

A company needs to find a middle-ground between:

Data Engineers creating completely abstracted tools that data consumers with no technical knowledge can use, but require a significant engineering effort to create.

that data consumers with no technical knowledge can use, but require a to create. Data Engineers providing tools that require so much technical knowledge that training data consumers to use them is not viable.

Data Build Tool (DBT) is a good example of such tool for writing transformation logic. It enforces a structure into SQL-based projects and provides templating of SQL using Jinja (Python). Assuming SQL is known by the data consumers, training them to the DBT specifics would allow them to be autonomous in writing maintainable transformation.

Read more about DBT:

The role of data engineers when they are not writing data pipelines

Data Engineers provide and support individual components that make up pipelines. For example at Earnest, we run most of our tasks containerized on Airflow, so as data engineers we maintain multiple containerized CLI tools to do tasks like translating Hive schemas to Redshift schemas, run Spark jobs on Livy, run our custom QC tasks etc..

Data Engineers also provide tools (e.g. a Metadata Hub, Airflow, Great Expectations, Snowflake) that they support, extend and create abstractions on top of, to increase productivity of data consumers.

Once Data Engineers are freed up from crafting individual pipelines, they can look into tools that benefit the pipelines as a whole. LinkedIn’s Dr-Elephant, a tool to detect common optimization opportunities on Spark and Hadoop jobs, is a good example of such a tool.

Additionally, Data Engineers need to keep enabling data accessibility by making sure their tools can be used by the data consumers directly, requiring minimal or no intervention from engineering.

“Data engineers can focus on pipeline idempotency, integration of new sources inside the data lake, on data lineage and tooling.” (source)

In conclusion

Data democratization is about increasing access to data, making sure that different teams with different skills are equally equipped to ship insights and data products.

The key to a successful data democratization might very well be to make sure your different processes scale with the increased demand for data.

Your objective should be to identify and remove the bottlenecks.

Summary of problems and suggested solutions in this post: