The core innovations that created the discipline of software engineering are:

The ability to compile a set of inputs to executable outputs Version control systems to keep track of the inputs

Before these systems, back in the 1960s, software development was a craft, where a single craftsman had to deliver an entire working system. These innovations enabled new organizational structures and processes to be applied to the creation of software, and programming became an engineering discipline. This is not to say that the art of programming is not extremely important, it’s just not the topic of this article.

The first step to moving from craft to engineering was the ability to express programs in higher level languages through compilers. This made the programs easier to understand to the people who were writing them, and easier to share across multiple people on a team, because the program could be broken down into multiple files. Additionally, as the compilers got more advanced, they added automated improvements to the code by passing it through many intermediate representations.

By adding a consistent version system across all of the changes made to the code that ended up producing the system, the art of coding became “measurable“ over time (in the sense of Peter Drucker‘s famous quote: “you cannot manage what you cannot measure“). From there, all sorts of incremental innovations, like automated tests, static analysis for code quality, refactoring, continuous integration, and many others were added to define additional measures. Most importantly, teams could file and track bugs against specific versions of code and make guarantees about specific aspects of the software they were delivering. Obviously there have been many other innovations to improve software development, but it is hard to think of ones that aren’t dependent in some way on compilers and version control.

Everything-as-code: Applying Software Engineering’s core innovations elsewhere

In recent years, these core innovations have been applied to new areas, leading to a movement aptly titled everything-as-code. While I wasn’t personally there, I can only assume that software developers met the first versions of SVN back in the 70s with a skeptical eye. In much the same way, many new areas consumed by the everything-as-code movement have garnered similar skepticism, some even claiming that their discipline could never be reduced to code. Then, within a few years, everything within the discipline has been reduced to code, and this has led to many-fold improvements over the “legacy” way of doing things.

Turning code into infrastructure using a “compiler” layer of virtualization and configuration management

The first area of expansion was infrastructure provisioning. In this example, the code is a set of config files and scripts specifying the infrastructure configuration across environments, and the compilation happens within a cloud platform, where the config is read and executed alongside scripts against the cloud service APIs to create and configure virtual infrastructure. While it may seem like the Infrastructure as Code movement swept through all infrastructure teams overnight, a ton of amazing innovations (Virtual machines, software defined networks, resource management APIs, etc.) went into making the “compilation” step possible. This likely started with proprietary solutions from firms like VMWare and Chef, but it became widely adopted when public cloud providers made the core functionality free to use on their platforms. Before this shift, infrastructure teams managed their environments to ensure consistency and quality because they were hard to recreate. This led to layers of governance, designed to apply control at various checkpoints in the development process. Today, DevOps teams engineer their environments, and the controls can be built into the “compiler”. This has created an orders of magnitude improvement in the ability to deploy changes, going from months or weeks to hours or minutes.

This enables a complete rethink of the possibilities for improving infrastructure. Teams started to codify each of the stages for creating their system from scratch, making the compilation, unit testing, analysis, infra setup, deployment, functional and load testing a fully automated process (Continuous Delivery). Additionally, teams started testing that the system was secure both before and after deployment (DevSecOps). As each new component moves into version control, the evolution of that component becomes measurable over time, which will inevitably lead to continuous improvement because we can now make guarantees about specific aspects of the environments we deliver.

Getting to the point: the same thing will happen to data governance

The next field to be consumed by this phenomenon will be data governance / data management. I’m not sure what the name will be (DataOps, Data as Code, and DevDataOps all seem a bit off), but its effects will likely be even more impactful than DevOps/infrastructure as code.

Data pipelines as compilers

“With Machine Learning, your data writes the code.” — Kris Skrinak, ML Segment Lead at AWS

The rapid rise of Machine Learning has provided a new way to build complex software (typically for classifying or predicting things, but it’s going to do more over time). This mindset shift to thinking of the data as the code will be a key first step to converting data governance to an engineering discipline. Said another way:

“Data pipelines are simply compilers that use data as the source code.”

There are 3 things that are different, but also more complex, about these “data compilers” compared to those for software or infrastructure:

Data teams own both the data processing code and the underlying data. But if the data is now the source code, it’s as if each data team is writing its own compiler to build something executable from the data. With data, we have been specifying the structure of data manually through metadata, because this helps the teams writing the “data compiler” understand what to do at each step. Software and Infrastructure compilers typically infer the structure of their inputs.

We don’t understand how data writes code

3. We still don’t really understand how data writes code. This is why we have data scientists experiment to figure out the logic of the compilers and then data engineers come in later to build the optimizers.

The current set of data management technology platforms (Collibra, Waterline, Tamr, etc.) are built to enable this workflow, and they’re doing a pretty good job. However, the workflow they support still makes the definition of data governance a manual process handled in review meetings, which holds back the type of improvements we saw after the advent of DevOps & Infrastructure as Code.

The missing link: Data Version Control

Applying data version control. Credit to the DVC Project: https://dvc.org/

Because data is generated “in the real world,” not by the data team, data teams have focused on controlling the metadata that describes it. This is why we draw the line between data governance (trying to manage to something you can’t directly control) and data engineering (where we are actually engineering the data compilers rather than the data itself). Currently, data governance teams attempt to apply manual control at various points to control the consistency and quality of the data. The introduction of version tracking to the data would allow data governance and engineering teams to engineer the data together, filing bugs against data versions, applying quality control checks to the data compilers, etc. This will allow data teams to make guarantees about the system components that the data delivers, which history has shown will inevitably lead to orders of magnitude improvement in the reliability and efficiency of data driven systems.

The data version control tipping point has arrived

Platforms like Palantir Foundry already treat the management of data in much the same way as developers treat the versioning of code. Within these platforms, datasets can be versioned, branched, acted upon by versioned code to create new data sets. This enables data driven testing, where the data itself is tested in much the same way as that the code that modifies it might be tested by a unit test. As data flows through the system in this way, the lineage of the data is tracked automatically by the system as are the data products that are produced at each stage of each data pipeline. Each of these transformations can be considered a compile step, converting the input data into an intermediate representation, before machine learning algorithms convert the final Intermediate Representation (which data teams usually call the Feature Engineered dataset) into an executable form to make predictions. If you have $10M-$40M laying around who are willing to go all in with a vendor, the integration of all of this in Foundry is pretty impressive (disclaimer: I don’t have a ton of hands on experience with Foundry; these statements were based on demos I’ve seen of real implementations at clients).

The DataBricks Delta Lake open source project enables data version control for data lakes

For the rest of us, there are now open source alternatives. The Data Version Control project is one option that is focused on data scientist users. For big data workloads, DataBricks has taken the first step in open sourcing a true version control system for data lakes with the release of their open source Delta Lake project. These projects are brand new, so branching, tagging, lineage tracking, bug filing, etc. haven’t been added yet, but I’m pretty sure the community will add them over the next year or so.

The next step is to rebuild data governance

The arrival of technology for versioning & compiling data puts the impetus on data teams to start rethinking how their processes can take advantage of this new capability. Those who can actively leverage the capability to make guarantees will likely create a massive competitive advantage for their organization. The first step will be killing off the checkpoint based governance process. Instead the data governance, science, and engineering teams will work closely together to enable continuous governance of data as it is compiled by data pipelines into something executable. Somewhere behind that will be the integration of the components compiled from data alongside the pure software and infrastructure as a single unit; although I don’t think the technology to enable this exists. The rest will emerge over time (and in other posts), enabling a culture of governance that reduces major issues while accelerating the time to value for machine learning initiatives. I know it sounds crazy to say, but this is an exciting time to be in data governance.