Guest post by Ankit Bahuguna & Faheem Nadeem, Software Engineer(s), Cliqz GmbH

Premise of Kubeflow is that ML products are complex distributed systems involving multiple components working together. Credits: “Hidden Technical Debt in Machine Learning Systems” — Sculley et al.

CEOs and CTOs are being challenged by customers, analysts and investors to define how Artificial Intelligence and Machine Learning will impact their revenues and costs. The leading research and development organizations are quickly migrating to open source machine learning frameworks, especially those that take advantage of the operational and infrastructure efficiencies provided by containers, micro-services and Kubernetes. This trend is demonstrated in a recent 451 Research survey which found that over 70% of enterprise organizations surveyed are using Kubernetes. GitHub has over 95M projects, and Kubernetes and Tensorflow are frequently in the top 10 projects, in terms of contributors, discussions, forks, and reviews. With an ever increasing availability of data and compute power, machine learning is turning out to be a powerful tool to solve various problems and helping achieve state of the art results. In such interesting times, Kubeflow has grown very quickly to be one of the most promising ML toolkits in the cloud native open source world.

We at Cliqz (a privacy focussed web browser with built-in web search operational in ~7 countries) are also solving some of the most complex problems around user privacy and web search using self managed Kubernetes (kops) on AWS. Since January 2017, we started our cloud native journey and have been building Web Search solutions using Kubernetes. Since December 2017, the Search-Recency system has been in production, helping us towards near-real time index updates leading to most recent and up-to-date search results. To solve this problem at that scale, we heavily use Machine Learning, Natural Language Processing, Deep Learning and core Information Retrieval techniques which led us to explore Kubeflow. We are currently evaluating Kubeflow as a general alternative to our custom ML workflow. We would like to present some initial assessments and how Kubeflow might work well for one’s k8s infrastructure are highlighted below:

Know Thy Users

It’s important that the target audience which would be interested in Kubeflow should be looked up closely. Most organizations which have an established infrastructure might be reluctant to even move to kubernetes. For example, it took a good amount of time for most teams to migrate to Terraform based deployments and because of this investment in time, switching to kubernetes is sometimes not appreciated. For a cloud native strategy where Kubernetes is preferred, Kubeflow becomes a good candidate for deploying and working with ML components. This brings to light, the following types of teams who can potentially be interested:

A team (within an organization) starting out their cloud native journey with K8s, who might want to leverage the consistency offered by Kubeflow for ML workloads for new projects. A very early stage startup which has started out with K8s as base. Teams interested in ML at Scale and want to ease deployment of existing multiple services and reduce management of resources by switching to kubeflow and k8s. Research Teams / Institutes who want to minimize the complexity of managing an infrastructure for a data scientist or a researcher and instead provide a clean and consistent interface which eases setting things up using a few clicks. Teams interested in on-premise / multi-cloud deployments where there is no service offered which can provide a consistent experience.

Consistency in Infrastructure

One of the greatest advantage of using a k8s based deployment is the consistency and features offered out of the box. Often times each new service tries to implement the same fundamental requisites: monitoring, health checks, replication etc. Kubeflow provides a native way to extend the same features to an organization’s ML needs. This is particularly useful to augment existing services without rewriting deployments from scratch. Having Kubeflow in the organization means one needs to worry more on the problem at hand and less worry about how to set things up and manage it over time.

Multiple Use Cases

Team is researching a problem which can be solved with an ML technique. They just can focus on the problem and not on the infrastructure. A Jupyter Notebook pinned to a GPU instance or a cluster abstracts this out cleanly. Several Researchers can work on shared notebooks and also use the same data backends instead of copying the data over to individual instances. Road to production for ML projects is simplified. The end to end solution offered in Kubeflow helps to productionize an ML model in the fastest way. This allows a team of researchers to finish testing a model for accuracy on a Jupyter Notebook, Build a continuous data pipeline to keep this model updated via argo and then test production workloads using Serving / Seldon. Katib can be a central solution for hyper parameter tuning across several applications. Hyperparameter optimization is one of the most underappreciated yet most important aspects of machine learning. Katib provides the ground framework to extend this to multiple applications and have a shared view of this tuning with historical data. For example Hyperopt is a python library for such optimizations, but it largely is limited to only the scope of the project. For an organization where multiple teams and services are backed by ML they can leverage the common interface which Katib provides to learn more complex but powerful optimizations which can significantly impact the product at large. Also having an infrastructure leads to more teams trying out implementing some solution which can leverage the benefits offered via such optimization. With multiple frameworks being supported (Tensorflow, PyTorch and Mxnet), writing a distributed training or serving application (TFServing or Seldon) becomes a lot more easier.

On-boarding ease

It becomes easier to onboard a new developer and a researcher to introduce him to a single cloud independent platform. One can provide templates for deployment based on tasks, which can be easily scheduled on low cost infrastructure as compared to starting instances for test applications. Even for ML workloads, the researcher or a research engineer can abstract the use-case effectively without worrying about underlying cloud deployment.

Secure and Better control over Infrastructure

In an organization, moving towards K8s, helps to standardize some processes. Not only one can make the infrastructure more secure, but can also achieve better control over the same.