To build these artificial neural networks, Two Hat Security has been using Valohai to speed up training and retraining of the models and help them concentrate on the work at hand: saving children instead of configuring servers.

Machine learning infrastructure

The project started as a five person team using Amazon’s G2 instances with a company internal IT person that set up all machines for them. After a while the team started looking for alternative solutions. Given that early on in the project, most of the team members were interns, managing knowledge within the team also was an issue they wanted to solve. Managing login accounts and changing them after 6 months of internship became a maintenance hell.

When the team heard about Valohai they were in nirvana. Valohai not only managed the resources for them and could launch whichever hardware in the cloud they required, as well as their own hardware, but it also provided the team with easy access management to the training hardware.

The deep learning model

As the typical training dataset is about 30 GB, and the largest about double that, the models are quite complex to train. The models are mostly about object detection and classification of illegal material. To make the data smaller there is naturally some data preprocessing where for instance the original data is stored in NumPy arrays to reduce size.

For someone like me, with an engineering background and familiarity with Docker images, it was very easy to just jump into Valohai. We just configured our testing environment in Docker images and then configured the tests themselves in the Valohai YAML file, imported the project and boom! We had 30 hyperparameter sweeps on our first try. David Wang – Data Scientist, Two Hat Security

Mostly the models are trained with PyTorch but there is also some TensorFlow involved. The work is not limited to computer vision; decision tree algorithms are also used, in which cases the team has used scikit-learn, xgboost and LightGBM.

Machine learning tools

Besides the frameworks and Valohai’s orchestration mentioned above the team is also using Amazon’s Sagemaker to do data exploration with. Sagemaker has limitations on the amount of data that can be stored in it (5 GB) and when the team shut down their Sagemaker instances and re-launched them, many things always have to be re-configured. The notebook approach is nevertheless used heavily in early phases of new models for exploring the data and when the exact model has to be trained it is moved into Valohai.

From Two Hat Security’s point of view the largest reason for going with Valohai was that it manages resources elastically – both in terms of hardware resources and in team accounts.

Valohai is a super stable environment for using computing resources and thanks to it none of us need to compete about resources internally anymore. Everything is in isolation, so I can even do some rapid testing and Valohai just shuts down the cloud instance when my test ends. David Wang – Data Scientist, Two Hat Security

The team previously used Tensorboard to view the progress of each training model, but now fully rely on Valohai’s online interface that shows the progression of for instance tens of parallel hyperparameter sweeps.