During my professional life, I have always focused on the end goal of things — and it usually related to bringing more money to my clients. Maybe because I read “The Goal” and got inspired a lot by the simple approach that also created “The Phoenix Project”. Anyhow, I always knew that by creating a Machine Learning model, implementing a Chatbot, setting up a data pipeline or managing a team we have to bring the model to production ASAP!

Working as the Head of Solutions Engineering at LINKIT I always go to clients that struggle with idea that they have plenty of Data Scientists, but the few models that are relevant take a long time to get into Production. But why? And how can we help?

First, we had to identify the main issue. It was clear that a vast majority of Data Scientists have focused too much in the inner cogs of algorithms (not that is not important, don’t get me wrong) and have forgotten that the most important thing here is one: having the right data, at the right time, in the right location.

Unless you work for an AI startup or giants like Uber, Facebook, Google and so on, your problems can probably be modeled by the algorithms already implemented on Keras, PyTorch, Theano, TensorFlow, MXnet, … Lots of companies value can be acquired by building a scalable and reliable data pipeline ending in a General Linear Model or a Random Forest… What I tell everyone on my team is that you are not paid to only code or to find new algorithms, you are paid to generate money with your expertise.

I have seen companies with dozens of Data Scientists building different models, doing fantastic work that ends in amazing run-once models that take 3–12 months to see the light of a production deployment. Also, the fact that machine learning development (in general) focuses on hyperparameter tuning and data pipelines does not mean that we need to reinvent the wheel or look for an entirely new way to do this. The market and the DevOps movement lays a strong foundation: culture change to support experimentation, continuous evaluation, sharing, abstraction layers, observability, and working in products and services.

“The Data Engineering and Data Scientist debate to me is like professional racing. You should not hire 30 drivers and have one engineer, as the ratio should be inverted. You should have 2 drivers in a team to drive the 2 cars while 30 engineers make sure that the car is suitable for the driver to shine. Data Scientists are the drivers of the cars that the Data Engineers build!”

Photo by JD Hancock, showing the importance of Data in development… :-P

Deterministic functions Vs. ML Development

Traditional software development relies on if/else loops and deterministic function, which encapsulate the vast majority of cases in the industry. The tools we built to support the ecosystem, the way we debug and test… all were designed with those use cases in mind. A different perspective must arise for ML applications!

As a direct result of the DevOps culture, data science practitioners must also absorb a lot of the industry gains from the last year: deployable artifacts, observability, a sharing culture, experimentation/failure in its core, and also working in products and services. Another point that developers look at me with a bitter-look is: you need to experience the “hell of operations”, being on-call because of your model, your API. This will improve the quality of code and products that emerge from these experiments.

Photo by JD Hancock showing how you can make a mistake and call AI & ML the same thing… :-P

AI, ML, and the hype confusion

My definition of AI is not the truth nor the unique one, far from it. I usually point this out at the beginning of talks to have a common ground with the audience, because I have noticed that sometimes ML, AI, data science, deep learning are used interchangeably.

Artificial Intelligence is making computers capable of doing things that when done by a human, would be thought to require a unique intelligence.

Machine learning is one for the ways for building the intelligence part, making sure that these machines can find patterns without explicitly programming them to do so.

To make it less vague, especially for the ML part, we can imagine the following catastrophe — If you want to create a rule to identify cats in pictures, you could try to do this with traditional loops if/else and rules:

Digital pictures are composed of values of RGB at every pixel, varying from 0 to 255.

When the 4th pixel has an RGB value (24, 42, 255) and the 5th pixel is (28, 21, 214) and blah blah blah — it is a cat!

Can you imagine all the possibilities of if/else you would have to write? Infinite! Also, when you get a new picture that you have never seen, would you catch in a clause? This makes the world extremely binary and hard to look at!

Machine learning allows us to have algorithms that increase the chance of having a picture containing a cat. These algorithms are based on statistical learning (a process where you try to find a predictive function based on the data you have) and can be trained based on data.

Photo by JD Hancock showing that you can be ahead and overcome some ML lifecycle challenges… :-P

ML Lifecycle challenges

ML lifecycle has the same problems that we solved for traditional development but with a different perspective now: version control, packaging, deployment, collaboration, and serving. The main issue is that we are trying to force the solutions we used before in software development, into this ecosystem

In the last year, we saw a significant increase in the number of products (especially Open Source) trying to solve ML’s development lifecycle — Spark runs on top of Kubernetes, kubeflow, MLflow, and cloud providers giving tools that allow the training and serving of models. We are witnessing the maturation of this ecosystem, and its a growing environment.

What I tell everyone is: there is no silver bullet, and the tools are there. Cultural changes and optimizing the way you do your work to be more productive are by far, the hardest things to do!

Photo by JD Hancock — ML bugs are not what we are used to… :-P

The ML bugs

Concerning the bugs, it is important to keep notice of machine learning bugs: Bias, Drift, and Fragility.

Bias comes from the prejudice that exists on the datasets used to build the feature and can have catastrophic results, mainly when used on blackbox-like models (Hello, DeepLearning!) — Amazon scrapping ‘sexist AI’ tool is one example of an ML model that passed all the tests in a company known for its high-quality of engineering. However, while trying to filter and recruit software engineers, the data is biased because we are part of an industry where women are a largely underrepresented group. That meant that the algorithm was disfavoring women applicants since it didn’t have lots of cases on the training set. This can also happen in mortgage scoring systems and many other businesses. Weapons of Math Destruction is a book from 2016 that raised a lot of these problems in algorithms making crucial decisions concerning hiring, classifying people and others — this is a great read!

Drift occurs when models are built, working well and deployed. You may consider that the is job over, and nothing else is needed, right? Unfortunately not. The model must be recalibrated and resynced according to the usage and data to keep accuracy. Otherwise, it will drift and become worse with time.

Fragility may be associated with bias, but more related to changes outside of the team’s reach. A change in definition, data that becomes unavailable, a null value that should not be there… how can your model cope with these issues, how fragile is it?

The worst part is, the majority of these bugs in ML are hard to be identified before production — for Bias, there are plenty of ways, but not much because it is seen as a slowness factor. That is why monitoring and observability, other pillars from DevOps, play a gigantic role in machine learning components.

You must measure proxies that identify the business value that your ML components should impact. For example, have you created a recommendation engine, and are you applying an AB test strategy to roll-out? Let’s see the variation of spend between the two groups. Alternatively, maybe you have an image tagging component now, so are people using the features around it? You cannot directly track ML components, but you may be able to analyze proxy measures on it. These types of metrics and focusing on measuring can help you to detect and approach the ML bugs early on: bias, drift, and fragility.

Photo by JD Hancock — bridge the gap between ML models and IT Ops

The distance between Data Scientists and IT Operations

The same problem that affected (still does) the business world and “gave birth” to the DevOps movement is also one of the biggest problems that Data Science teams face every day — a distance between the business and the actual industrialization/operationalization of what is built.

This gap is the result of three things: slowness (things flowing from idea to production taking a gigantic time), lots of handovers (X talks to client, A writes the user story, B builds, C validates, D approves, E deploys, F operates, G fixes bugs, H rebuilds,…), and clustered teams working on projects, not products (The Accelerate book is a great read to look into this!). This distance is getting more explicit and more evident as organizations start to change the way they approach software development and delivery lifecycle. Why? Because we can see a lot of organizational wins in adopting new practices, tools and doing the hardest thing: changing the culture.

To improve sharing and collaboration is not easy; that is the hardest thing that any organization can do: change the culture. In the case of ML engineers and data scientists, some cultural aspects can impact a lot, but the most compelling one I have seen is related to the background of the professionals.

The majority of them have a very academic background, meaning that they are used to spending long periods working on one problem until it is good enough to be accepted in a publication. The bar there, to be good enough, is extremely high, not just on some metrics but also on the design of the experiments, mathematical rigor, and so on. In a business context, this is important but less so… That means that it is OK to publish a model with 40% accuracy and have it on a deployable state. It is better to have that ready and consider putting it in production today, than waiting months to have something “good enough”. Maybe in three months that will not be a problem worth solving anymore. Moving fast with flexible possibilities is the best way to go.

Photo by JD Hancock — move faster and have a CI/CD/CE pipeline for your ML models

What should you do to gain (more) value (faster) from ML?

Training data scientists and generating value from ML techniques to build AI applications are extremely hard. To make it viable, fun and to attract these professionals, we must change the culture around it. It is hard to design a path to this “optimal culture”, as every company has its own way and interactions are hard. Some cultural characteristics I have seen that support a short time-to-market that can help to generate value to ML projects include: