by Roman Seyffarth

Judging by the many 5-minute tutorials for bringing a trained model into production, such a move should be an easy task. However, there are many different libraries and products popping up lately, indicating that everyone – including tech giants – has different opinions on how to build production-ready machine learning (ML) pipelines that support today’s fast release cycles. So, not that easy after all? It’s actually quite hard for reasons I will point out. While you could invest in an all-in-one solution, it may be difficult to justify the costs in early adoption stages.

I invite you to join me as I go back to the drawing board and think about a sane approach to planning an ML pipeline that fits your organization’s needs. This blog post will be part one of a series about bringing models to production. Today, we will look at the goals you may want to achieve with an ML pipeline, different technical approaches, and an example architecture. In upcoming posts, we will focus more on hands-on technical implementations. Chances are that you will be picking up valuable orientation advice on the way for your transition!

Why a machine learning pipeline is important

As the topic receives more and more coverage in technical literature, an increasing number of companies begin experimenting with it and evaluate possible applications in their business domains. While these proof of concepts might yield promising results, there often remains confusion about how to integrate resulting models into existing systems and processes.

However, reaching production – even only with an MVP – is the best way to gain internal or external attention and funding. To avoid reinventing the wheel for every new application, an efficient way is to build a central pipeline that defines a clear path for making trained models available to use, while still allowing flexible experimentation during model development. Such a pipeline handles many recurring tasks and manages compute resources so that data scientists and engineers are able to focus on the specifics of their current application instead of thinking about “housekeeping”. This most likely leads to faster release cycles.

Another challenge is the knowledge gap often present in organizations: data scientists lack engineering knowledge whereas engineers lack insight into model creation. A pipeline helps to clarify the process and defines clear areas of responsibility and artifact formats.

Desirable goals for a machine learning pipeline

It’s tempting to dive right into one of the many all-in-one solutions that exist out there. While an all-in-one solution can be a viable choice, I encourage you to first think about your organization’s needs and the skills of your ML team. Here are some important aspects to consider:

Lifecycle Coverage – Ideally the pipeline spans all the way from initial experiments to the deployment of a model. This makes onboarding and collaboration easier and each model follows a clear path.

Freedom During Experimentation – Standardized procedures (such as a pipeline) tend to not only introduce order and coherence but also annoy people if they are too restrictive. Make sure to involve your data scientists to determine an appropriate level of flexibility for experimenting and creating models. Settling on a specific ML library (e. g. TensorFlow) is more restrictive than settling on one programming language (e. g. Python). Using containers is even less restrictive. However, the more restrictive the better the pipeline can be optimized to library-specifics.

Tracking Experiments – Storing the model code, hyperparameters, and result metrics of every experiment is important to be able to discuss results, decide in which direction to go next or reproduce experiments. The simplest format for this could be a shared spreadsheet, but more sophisticated options are available – though they often come with additional requirements for the rest of the pipeline. As an example, here is a screenshot from MLflow`s experiment tracking server UI:

Automation – Even when a lot of steps are performed manually in the beginning, make sure that all steps can be automated later on via APIs or certain tools. Repeated manual tasks are boring and error-prone.

Model and Code Versioning – Versioning the model code is essential for reproducing training runs. The pipeline should also be capable of managing versioned artifacts for at least every model that is supposed to be deployed in production. This is necessary for rollbacks and A/B tests in production.

Model Testing – Besides general accuracy metrics acquired during model training, testing a packaged model’s general serving ability as well as typical predictions can improve the release/deployment quality.

Scalability – If you invest into building an ML pipeline, you may want to eliminate the need to rebuild it just because it can’t handle increasing volumes of data or the demands of a growing team of data scientists. Make sure that the pipeline is scalable by choosing building blocks that are scalable themselves.

Security – The pipeline should adhere to current security standards regarding disk and transfer encryption as well as access authorization, especially when the training data and model is considered to be sensitive.

Monitoring – After a model has been deployed for serving, its usage and performance should be monitored to ensure proper operation, enable dynamic scaling based on load, or even use this data to improve future model versions.

Designing a pipeline

With our goals in place, we are now able to design a pipeline that satisfies them. In the following I will assume that we want to deploy our models as servables, exposing an API. For other deployment targets (e. g. mobile on-device predictions), the build process would need to be modified accordingly. Also, I will not get into details on data sources for model training, since they are not directly related to model deployment. Here is a tech-agnostic version of the pipeline which highlights model-related data flow using yellow arrows:

Just like we wanted, the pipeline spans all the way from model training experiments to serving deployed models in production. Data scientists are able to experiment using interactive notebooks and write training jobs (1) while their code is being versioned in a source code repository (2). Ideally, they are able to execute their training jobs inside a compute cluster (3) for speeding up demanding tasks like automatic hyperparameter optimization.

All completed training jobs push the trained model along with performance metrics, used hyperparameters, dataset information, and the code revision/commit hash into a separate store (4). This store should make searching and comparing different models as easy as possible. If an organization has distinct scientist and engineer roles, this store could function as a clear interface between them.

Once candidates from the model store are ready to be moved to production, a build server (5) tests the model for typical or production-critical cases and packages it as a deployable and uniquely versioned artifact (e. g. executable binary or container image). The artifact gets stored in an immutable repository/registry (6). Then, the build server may also deploy a model automatically by installing the artifact on one or multiple target machines or triggering containers to be run based on the newly created image (7). The entire build configuration of each model can be placed in the same source code repository as the training code.

By gathering feedback from users of the model, utilizing A/B testing, and collecting monitoring metrics, we can continuously improve the model and manage the lifecycle of individual versions (8).

If you look back on our goals, we have tackled a lot of them with this pipeline. We are versioning model code, build server configuration and released models. We are also tracking all experiments while still retaining freedom during experimentation. Scalability, automatability, and security hugely depend on the properties and capabilities of chosen technologies. However, scaling models in serving is often a trivial task, since they usually do not contain any runtime state. Running multiple instances of the same model with a load balancer in front is usually enough.

Choosing the right building blocks

After all these generic and theoretical plans, let’s talk tech. The next step is to cover every aspect and step of this pipeline using technical solutions. This is no easy task since they have to be scalable to your current and future needs, interoperate with each other and lie within your budget. Take your time, evaluate different constellations, and have a very good understanding of what knowledge already exists in your organization and which technologies your team is motivated to work with.

I will not talk about different solutions to host source code repos or run build servers. However, you can see valid choices for that in my example later on. Let’s first focus on the most popular ML-specific components.

Open Source platforms

Platform solutions span multiple stages of the ML pipeline resulting in a good coherence, but each one dictates a certain way to work.



MLflow Offers an experiment store, model serving, and supports Python-based training using all major frameworks. Makes heavy use of Anaconda. Worth a look if you are using Python and a variety of frameworks. In my opinion the most versatile open source platform solution with a very good UI to browse experiment data.

KubeFlow Offers Jupyter notebooks, training, hyperparameter tuning, experiment store, and model serving – all hosted on Kubernetes. Focused on, but not limited to TensorFlow. Worth a look if you are using TensorFlow and Kubernetes.

Apache PredictionIO Offers training, hyperparameter tuning, experiment store, and model serving. Works with Spark-based models. Includes an “event server” for collecting events as training data. Worth a look if you use Spark and find the concept of an included event server a good fit for your use cases.

Cloud platforms

Every major cloud provider offers a managed ML platform among their services that spans from notebooks to serving models. Using them can be beneficial when most training data already lies in the cloud or training the models requires a lot of computing power while the organization lacks suitable resources.



Google ML Engine Offers TPUs, which are even more powerful when training complex models than GPUs. Lacks broad framework support (full support only for TensorFlow and scikit) and a comfortable experiment store. With the “AutoML” products, models can be trained for certain problems without requiring ML knowledge and programming.

Amazon SageMaker In my opinion the most versatile cloud solution with a wide variety of supported frameworks. Additional services like “Ground Truth” for labeling datasets and pre-built algorithms.

Azure ML Service Framework support and workflow comparable to Amazon’s SageMaker. “Machine Learning Studio” enables you to build models using a graphical editor.

Fully custom and Open Source

Of course there is the option to build a completely custom solution. This can be beneficial if platform solutions are too much overhead or not flexible enough. There may also be a lot of open-source expertise or affinity in the organization.



scikit-learn A lightweight and easy-to-use library containing a variety of different ML algorithms, metric evaluations and visualizations.

TensorFlow + TFX The most popular framework for neural networks while also supporting other algorithms and custom compute graphs. Can operate in a highly distributed setting. TFX adds functionality for production use, such as a generic model server and consistent feature preprocessing during training and serving.

Keras A library that offers a high-level, user-friendly API for creating neural networks. Needs either TensorFlow, Theano or CNTK under the hood as its “backend”. Keras will be the official standard high-level API for TensorFlow 2.0+ and comes already packaged with it.

PyTorch A distributed framework for neural networks that has similar capabilities as TensorFlow, but a slightly easier usage and learning curve. Due to it being newer and less popular, the ecosystem and resources are not as extensive.

Apache Spark A JVM-based (Scala/Java) big data processing framework with a Python API and a collection of ML algorithms (MLlib). Can be executed locally, in a Spark cluster or on Hadoop. Has especially powerful preprocessing capabilities. Sacred A framework-independent tool for storing experiment metrics. Typically uses MongoDB as data store. The web UI Omniboard makes stored experiments browsable and searchable. Ray Tune A distributed hyperparameter tuning library with a variety of different optimization techniques. Does not depend on a specific ML framework. hyperopt Lightweight, framework-independent hyperparameter tuning library.

The list above is by no means complete, but it surely contains the popular choices in the respective categories and is a good way to begin evaluating. Of course, writing components on your own is always an option: I’ve seen multiple times that organizations have built their own GUI to abstract away complexity.

Example: Open Source ML pipeline using Python

Here is a possible pipeline that consists entirely of either free or open source components and settles on Python as the common denominator for all models.

Each model project has its own repository in a self-hosted GitLab. Since we decided that all models will be created using Python, we have a convention that the repository has to contain a Pipfile (created using pipenv) describing the desired Python environment. Our data scientists work on the model code by experimenting using Jupyter notebooks and running training jobs locally. The training jobs use Sacred to write information about each training run (Git hash, parameters, metrics, etc.) into a MongoDB database. The resulting model gets transferred to a central Ceph file store and referenced in Sacred’s run information. The experiments can be searched and compared via the web interface of a hosted Omniboard instance. Remote execution highly depends on the framework, but we have a GPU-enabled Kubernetes cluster that can be used for accelerated and distributed training.

Our data engineers configure build pipelines in our Jenkins-based build environment that create Docker images from a given model of our model store. The image wraps the model and exposes it via an API. The build pipelines are usually different for each model, so their files and all files required for building the Docker image are checked into the code repository as well. Jenkins will also test the image, push it to a self-hosted Docker registry and deploy it in our production Kubernetes cluster.

Key takeaways

We have seen that building an ML pipeline is not an easy task. There are countless building blocks to choose from in a field that is constantly changing. Hopefully, I was able to give you a good overview and a foundation to start planning a pipeline for your organization. I want to leave you with some key takeaways:



Involve as many people as possible in the planning phase – both scientists and engineers – since their adoption of the pipeline is crucial. Existing expertise could be a factor when choosing technologies. Also, find out which technologies your team is motivated to work with. Make sure that your pipelines and the components involved are scalable enough to handle your organization’s ML demands for the foreseeable future. A well-crafted ML pipeline enables fast iterations on models and brings them into production. This can be a huge advantage if you have the need for fast release cycles and the amount of data and feedback to support it. One particular technical challenge often faced and highlighted is how to ensure that the same data transformations that are applied during training are also applied on prediction input. This includes constants that were computed during training (e. g. for normalization). Consider this early on in your planning phase. If you are using TensorFlow, for example, take a look at TensorFlow Transform which addresses this challenge.

More on the topic

Thank you for staying with me all the way! Stay tuned for more blog articles of this series, where we will dive deeper into technical details by showing hands-on examples. If you are interested in German e-learning content on machine learning, make sure to visit codecentric.ai.