TLDR:

Hyperkops is a technology which enables the use of Hyperopt (a Bayesian optimisation library in Python) in Kubernetes. This provides a scaleable platform in which Data Scientists and Engineers can share infrastructure to estimate the optimal set of hyperparameters for their machine learning models.

Introduction

We introduce technology that enables Bayesian optimisation of hyperparameters of machine learning models using the Python based Hyperopt Bayesian optimisation library.

Bayesian optimisation is a sequential learning approach where iterative fitting of models is combined with Bayesian inference to arrive at an estimate of best possible hyperparameters. Hyperkops uses the Python library Hyperopt to execute this type of optimisation, and scale the parallel execution of these calculations by deploying them on Kubernetes. It exploits Hyperopt’s ability to run calculations asynchronously by storing state in MongoDB alongside Kubernetes’ ability to automatically scale our hardware provisions to match our current requirements. Thus, by running these calculations within Kubernetes, we can create a system which autoscales to our needs and can be used by multiple tenants at the same time.

The Kubernetes environment also allows us to connect to this architecture from within the Kubernetes cluster as well as from our local development machines. This means this architecture can provide mass parallelism for our scheduled fitting of production models at the same time as providing a tool to our data scientists for exploration from their local machines.

Motivation

At hipages we find ourselves creating lots of models, and lots of different versions of those models with slightly different applications. We’re also finding that choosing the correct hyperparameters for these models can have a significant effect on performance.

Performance-wise we tried a few different ways of searching our hyperparameter space:

Grid-search — Iterating through a fixed range of hyperparameters. This is very computationally expensive and time consuming, with significant time wasted on fitting models in part of the hyperparameter space that is likely to yield underperforming candidates

— Iterating through a fixed range of hyperparameters. This is very computationally expensive and time consuming, with significant time wasted on fitting models in part of the hyperparameter space that is likely to yield underperforming candidates Random-search — Randomly sampling hyperparameters from a bounded space. We found this to be effective, but because of the stochastic nature of the hyperparameter selection in each fit it’s likely that the hyperparameter selection is sub-optimal, and similar to Grid-Search there is lots of computational time and effort spent on low performant models.

— Randomly sampling hyperparameters from a bounded space. We found this to be effective, but because of the stochastic nature of the hyperparameter selection in each fit it’s likely that the hyperparameter selection is sub-optimal, and similar to Grid-Search there is lots of computational time and effort spent on low performant models. Bayesian Search: An iterative method where each fit is targeted at improving the model’s performance, with the selection of the updated hyperparameters based on information from previous fits. We found this to improve performance significantly but it also takes longer due to the sequential nature of this approach.

In our case model performance was key so we decided to forge ahead using Bayesian search methods. For this task we selected the Python library Hyperopt as our Bayesian optimisation tool.

To speed up our fits we needed to parallelise the work-flow, and have multiple trials fitting at once. Hyperopt parallelises its workload by storing state within a MongoDB instance, and having the actual fits done by workers. The workers poll the MongoDB instance and, when a trial (a trial here would be fitting a model with the provided hyperparameters) is available, execute the workload and return the scoring metric for that set of hyperparameters. This separation of workloads lends itself well to deployment on Kubernetes.

Deployment on Kubernetes also comes with lots of other benefits:

Autoscaling : We can automatically scale up and down the number of workers we have available to us based on specific metrics (e.g. CPU usage). So when we have multiple hyperparameter optimisations running in parallel the cluster will automatically scale up to meet our needs, and then scale back when we’re done.

: We can automatically scale up and down the number of workers we have available to us based on specific metrics (e.g. CPU usage). So when we have multiple hyperparameter optimisations running in parallel the cluster will automatically scale up to meet our needs, and then scale back when we’re done. Production Fitting: Once we’ve deployed our fitting infrastructure we can connect from both inside and outside the Kubernetes cluster. In our case this means we can automate and schedule the fitting and export of our final model from within Kubernetes alongside our usual ETL workflows.

Once we’ve deployed our fitting infrastructure we can connect from both inside and outside the Kubernetes cluster. In our case this means we can automate and schedule the fitting and export of our final model from within Kubernetes alongside our usual ETL workflows. Use in Discovery Phases: During discovery phases a data scientist might want to fit multiple models to get a feel for the model’s performance. By deploying our system on Kubernetes they can dial up mass parallelisation of the models by connecting to the cluster and submitting jobs from their local machines. This means there is no need to closely manage expensive cloud-compute instances. The reduced costs keep Dev-ops happy, and can significantly reduce our time-to-insight during discovery.

Hyperkops

With these benefits in mind we set out to deploy Hyperopt within our own Kubernetes, but found that we need to add a few extra components to get our hyperparameter searches to finish. The result of this work was our Hyperkops architecture, as depicted below:

Hyperkops Architecture

The Hyperkops architecture is comprised of three main components:

Hyperkops Worker: Each of these Pods run a single hyperopt-worker. These Pods therefore poll the MongoDB for trials to execute, and return the score they generate with that trials’ hyperparameters. In our context a trial would be fitting a single machine learning model with the hyperparameters provided, and returning the performance metric.

Hyperkops Monitor: Identifies and updates Hyperopt trials as failed due to Pod failure or rotation

MongoDB: MongoDB Instance that holds the state of all the trials

Hyperopt Fitting Master: These are the processes which launch the Hyperopt optimisations.

Why is the Hyperkops Monitor required?

In Kubernetes the Pods which execute the hyperopt-workers can be significantly shorter lived than some optimisation jobs and are expected to get rotated on a regular basis. If a Pod is deleted whilst executing an experiment the hyperopt-worker will be killed before it can emit an error signal and jobs remain in MongoDB indefinitely in a JOB_RUNNING_STATE. Hyperopt waits until all of its running jobs have finished before selecting the best possible set of hyperparameters, and therefore jobs which are trapped in a JOB_RUNNING_STATE, with no way of concluding, leave our optimisation jobs running in perpetuity.

We therefore introduced an extra component (the Hyperkops Monitor) to monitor our deployment, and update relevant MongoDB entries for trials we know to have been running on failed or deleted Pods. This component thus allows our hyperparameter optimisation to conclude and return the optimal set of hyperparameters it discovered.

Future Work

We’re just at the start of this project, but here’s a flavour of what we have in mind for the future:

Automate the installation of required libraries : When a worker Pod receives a method to execute it requires the Pod to either already have the required libraries installed, or for the method itself to to ensure it installs the libraries on the fly. A next step will be to produce a class that allows us to automate these installation processes.

: When a worker Pod receives a method to execute it requires the Pod to either already have the required libraries installed, or for the method itself to to ensure it installs the libraries on the fly. A next step will be to produce a class that allows us to automate these installation processes. Add Monitoring UI: MongoDB and the Kubernetes API provide us with sufficient information to monitor the system from a hardware level through to metrics around the current optimisation processes. One advancement of this system would be to provide a single UI which allows to monitor all of these metrics in real-time.

MongoDB and the Kubernetes API provide us with sufficient information to monitor the system from a hardware level through to metrics around the current optimisation processes. One advancement of this system would be to provide a single UI which allows to monitor all of these metrics in real-time. Add high-availability to MongoDB: In the example workload we’ve provided, MongoDB is deployed as a single Pod, with no persistence of the data or resilience to failure. This means that if the Pod is rotated, or fails, then all work done by the cluster will be lost and restarted from scratch. In order to guard against this failure mode we need to concentrate on bringing up MongoDB in high-availability mode, which will help protect the system from these forms of failure.

Conclusion

We’re really excited to open-source this bit of software, and hope people find it as useful as we do. We’d really like to thank the developers of the Hyperopt library for creating something great! Don’t hesitate to nip over the Github Repo if you have any feedback or want to contribute to the library.