What do you call a machine-learning-focused data scientist who tests, versions, documents their code like a professional software engineer?

Unfortunately, I don’t know, because I’ve yet to meet one.

First, a bit of background. Continuous Integration (CI) was introduced by Grady Booch in 1991 and later adopted by proponents of Extreme Programming. CI enabled Extreme Programming teams to work faster by minimizing the time needed to integrate the contributions of multiple developers into one codebase. Today CI is the prevailing standard in software engineering.

Following the CI methodology, developers write automated tests that verify that the code performs the functions they are supposed to. As more features are added, these automated tests give the teams the confidence to move fast and spot errors while they are easy to fix.

Creating Machine Learning (ML) models involves a lot of coding, so my expectation was that ML teams also use modern software engineering practices like CI. I started asking other professional ML engineering team what they used, and found that most of them used CIs sparingly or just not at all.

While most ML teams admitted to not using CI, most knew, at some level that they should. They felt that the existing CI paradigm wasn’t really really working for them in ML.

I also agree that CI principles do not directly translate into the world of ML. Data scientists and ML engineers are not writing code according to a prototype specification, so it feels unnatural to write unit tests.

However, ML code is still code, and spotting bugs early and moving fast is still an important requirement.

In my mind, applying CI to ML should have two key goals:

To ensure the key bits of code are working correctly (reproducibility)

To see the progress we are making in our predictions (performance)

I named this adaptation of the CI goals to ML as continuous evaluation (CE).

The Continuous Evaluation framework

We used this framework quite effectively in one of our ML project. We wrote a git hook to execute the code to check for reproducibility and performance.

Reproducibility. On every commit we would re-run our key jupyter notebooks to make sure that their output did not change (this was especially useful for our data preprocessing).

Performance. We wrote a daemon that monitored our shared models directory for when a new model file was added. The daemon automatically ran a battery of tests on that model and recorded the results: both the jupyter notebook and a csv file of key metrics.

Finally, we created a simple dashboard interface to view the results for every commit and model file. We instituted that everyone in the team should link these in their pull requests.

Using the CE framework transformed how engineers wrote their code, naturally drawing them to write wrappers around models so that each model file could be tested in the same way, with all the relevant meta-information (parameters, model type, etc) stored in a way that is easily accessible in our dashboard.

Dashboard accessible for the whole team — we now know which models we have, what they do and that they are not broken.

The result was a clear increase in productivity: it was much easier to catch bug when we merged code from various developers, to reuse code, to communicate our findings both within the tech team but also to the business stakeholders.

Most importantly, we have achieved this without stepping on developer’s toes and letting them work as they like locally on their machine using any frameworks or methodologies they need or like.

What are your experiences with implementing software engineering practices into ML teams? Is there an opportunity for a general framework or is every project too different?

Feel free to share your thoughts and comments below!