Why we do machine learning engineering with YAML, not notebooks

Notebooks are great for designing models, not deploying them.

Most data scientists spend the majority of their working hours in a notebook. As a result, most production machine learning platforms prioritize notebook support. If you try out a new production ML platform, chances are its onboarding tutorial will begin with a .ipynb file.

When we built Cortex, our production machine learning platform, we spent a lot of time considering the correct interface for defining production ML pipelines. Ultimately, we decided not to support notebooks, opting instead for YAML config files.

Notebooks were designed for experimentation

Notebooks are the modern incarnation of literate programming, a paradigm introduced in the ‘80s that sought to write code that reflected the programmer’s thoughts—not the computer’s processing—by combining code with natural language.

In all literate programming tools, the emphasis is on presentation, which is a big reason why notebooks are so useful.

For many data scientists, the finished product of a work session is a business analysis. They need to show team members—who oftentimes aren’t technical—how their data became a specific recommendation or insight.

A notebook, where paragraphs of formatted text can lay between cells of code and where charts can be displayed directly beneath the code that generates them, is an ideal format for this presentation.

Even better, notebooks are interactive. Want to see what the chart looks like with a second dataset? Just add a new cell. Want to test a different model? Tweak one line of code and rerun the cell.

However, the same qualities that make notebooks great for exploring and explaining data make them a poor fit for production.

Why we use YAML for production machine learning

When I say production machine learning, I’m referring to machine learning that manifests as a product feature. For example, Uber’s ETA prediction, or Gmail’s Smart Compose.

The priorities in building a production machine learning pipeline—the series of steps that take you from raw data to product—are not fundamentally different from those of general software engineering. Specifically, they are:

1. Your pipeline should be reproducible

Reproducibility is an issue with notebooks. Because of the hidden state and the potential for arbitrary execution order, generating a result in a notebook isn’t always as simple as clicking “Run All.” Just having another engineer reproduce your results—let alone having your code run automatically as part of a pipeline—is a significant challenge.

Instead of trying to streamline a notebook’s various imports and function calls into a more easily reproducible script, why not use something simple and declarative like YAML?

For example, this this cortex.yaml file defines the deployment stage of a pipeline:

The code to be executed, predictor.py , is clear, as are its configuration variables. It’s simple, readable, and will produce predictable results.

Now, there are some projects focused on parameterizing notebooks so that they can be treated as pure functions, but it’s always felt like an unnecessary “square peg in a round hole” effort to me.

2. Collaborating on your pipeline should be easy

Version control is at the heart of any modern engineering org. The ability for multiple engineers to asynchronously contribute to a codebase is crucial—and with notebooks, it’s very hard.

Git works by tracking the plaintext differences between file versions. With code, this results in a very readable experience, where you can easily visualize what is changing and how it impacts the software:

Notebook files, however, are essentially giant JSON documents that contain the base-64 encoding of images and binary data. For a complex notebook, it would be extremely hard for anyone to read through a plaintext diff and draw meaningful conclusions—a lot of it would just be rearranged JSON and unintelligible blocks of base-64.

When you combine this with the frailty of complicated notebooks, where cells often need to be run in an arbitrary but precise order to generate the right result, it makes collaboration tricky.

For example, imagine you had an ETA prediction feature, and your pipeline relied on a complicated notebook to export a trained model. No one would be able to work on the notebook, as any small tweak might lead to invisible but cascading changes, such that your model performs poorly.

Trying to reverse engineer what changes caused the performance drop would be hopeless, both because of the unreadable nature of notebook diffs and because of the explainability problems mentioned earlier. Your pipeline would, in essence, have a “don’t touch it or it will break” sign on it.

With YAML, however, this problem is solved. There is no hidden state or arbitrary execution order in a YAML file, and any changes you make to it can easily be tracked by Git:

If one of those changes breaks your model, it’s both reversible and investigable.

As with the last example, there are some projects dedicated to making diffing and merging notebooks easier, but it seems like a lot of effort to emulate YAML’s default nature.

3. All code in your pipeline should be testable

Connected to both of the above points, most modern engineering orgs (hopefully) have a process for testing code. Typically, it looks something like this:

Engineers write tests before pushing any code.

PRs are automatically reviewed by CI/CD tooling.

A final manual review is given by another engineer.

As a result, anytime the codebase is changed, it is done with the highest possible level of confidence that it will not break things.

With notebooks, this is difficult.

Python unit testing libraries, like unittest , can be used within a notebook, but standard CI/CD tooling has trouble dealing with notebooks for the same reasons that notebook diffs are hard to read.

As a result, it’s hard to ship a new notebook to production with a high level of confidence that it won’t break anything—and if something does break, good luck figuring out why.

Applying CI/CD to YAML files and the code they reference, on the other hand, is straightforward. Devops teams have been doing it for years.

Production machine learning is an engineering discipline

We built Cortex specifically because we wanted to build things like Spotify’s “Made For You” playlist or Gmail’s Smart Compose. Our focus was not on designing new models, but on building a pipeline to turn models into products.

To do that, we needed to build an interface that allowed users to specify which code should be executed at what time, with which configuration.

YAML and notebooks are both tools for that purpose, in a sense. A notebook, at a very basic level, is just a bunch of JSON that references blocks of code and the order in which they should be executed.

But notebooks prioritize presentation and interactivity at the expense of reproducibility. YAML is the other side of that coin, ignoring presentation in favor of simplicity and reproducibility—making it much better for production.