A Recipe for using Open Source Machine Learning models

A step-by-step approach to finding, evaluating and using open source neural network models

Photo by Luca Bravo on Unsplash

Machine learning continues to produce state of the art (SOTA) results for an increasing variety of tasks and more companies are looking to ML to solve their problems. With the incredibly rapid pace of machine learning research, many of these SOTA models come from academic and research institutions which open source these models. Often, using one of these open source models to bootstrap your machine learning efforts within your company can be much more effective than building a model from scratch.

However, these models are often released by researchers whose focus isn’t necessarily to enable easy use and modification of their models (though there are many exceptions). Using these open source models for your tasks can be quite difficult.

In this post, my goal is to provide a recipe you can follow to evaluate and use open source ML models to solve your own tasks. These are the steps I’ve used over and over again in my own work (as of this writing I have anaconda environments set up over 15 open source models). As my work is mostly using deep learning for vision and NLP, my focus here is specifically on using neural network-based models.

Whether you’re trying to use machine learning to solve real problems within your company or experiment with some fun SOTA results at home, my hope is that after this post, you’ll have a path to take an open source model and modify and use it to address your own task with your own dataset.

Step 1: Naming your task

The first step is figuring out what your particular task is called in the research literature so you can successfully search for it. This can initially be quite frustrating. For example, finding all the instances of a dog in a picture would be an “object detection” task. But if you want to know exactly which pixels in the picture correspond to dogs that’s called “image segmentation.”

There are a few ways you can try to figure this out. First, if you happen to know any ML researchers or practitioners, definitely start there. Another option is to ask in r/machinelearning or r/learnmachinelearning. If none of these pan out, the next step is to google to the best of your ability. As you land on research papers you’ll often see the name commonly associated with the task in the literature.

Step 2: Finding papers and code

Once you know what to search for, the next step is to find those open source models that best suit your task. There are a few resources that are helpful here:

paperswithcode: A repository of papers and associated code, organized by task. This is a really good starting point, especially if it’s a well-known task.

arxiv-sanity: Many open source models are associated with research papers. Most papers in machine learning are (fortunately!) openly published on arxiv. Searching arxiv for recent papers that solve for your task is another good place to start. Not all published papers here have code associated with them. If you find a paper you like, try searching for “<paper name> github” to see if the code has been released.

Kaggle: If there happens to be a Kaggle competition with a task similar to yours, this can be a great way to get high quality, state of the art models. Pay particularly close attention to winner blogs for past competitions, these often have great explanations and code. The little tricks that were used to win the competition can often be really valuable for your task as well.

Dataset benchmarks: If there’s a benchmark dataset that’s similar to the task you’re working on, the leaderboard for that benchmark is a quick way to find papers with demonstrably SOTA results.

Google: For standard/common tasks like image segmentation, searching for “image segmentation github”, “image segmentation pytorch” or “image segmentation tensorflow” will give you a lot of results.

Step 3: Read the papers

This can be intimidating because academic papers can be pretty inaccessible, even to experienced software engineers. But if you focus on the abstract, introduction, related work, results, and delay a lot of the deep details/math for later readings, you’ll find you can get a lot out of the paper and a deeper understanding of the problem.

Pay particularly close attention to the dataset(s) they use and the constraints of those datasets or their model. Often you’ll find that the constraints may not be applicable to you and are fundamental to the model design. For example, classification models for imagenet expect there to be one and only one salient object in an image. If your images have zero, one or more objects to identify, those models are probably not directly applicable. This isn’t something you want to find out after investing the time needed to bring up the model.

Also, follow some of the references, especially those you see in multiple papers! You’ll regularly find that at least one of the references provides a very clear description of the problem and dramatically increases your understanding. Referenced papers may also end up being more useful and may have nicer code associated with them, so it’s worth doing a little digging here.

Step 4: Make sure the code is usable

Once you’ve found a paper with open source code, make sure it’s usable. Specifically:

Check the license: While a lot of code is released under liberal open source licenses (MIT, BSD, Apache etc), some of it isn’t. You may find the model has a non-commercial use only license, or no license at all. Depending on your use case and company, the code may or may not be usable for you.

Check the framework: If you’re working with a particular framework (eg. Tensorflow, Pytorch) check the framework the model is built in. Most of the time you’re stuck with what you get, but sometimes there’s a reimplementation of the model in your preferred framework. A quick Google to check for this (eg. “<paper name> pytorch”) can save you a lot of trouble.

Check the language: Similarly, if the model is in Lua and you’re not a Lua developer, this can be really painful. See if there’s a reimplementation in the language of your choice (often Python, since in deep learning Python should be part of your repertoire), and if not you might be better off finding another model.

Check the coding style: Researchers aren’t all software engineers so you can’t have as high a bar as for other open source projects, but if the code is a total mess you may want to look for another model.

Step 5: Get the model running

Results from NVIDIA’s StyleGAN trained on a custom furniture dataset

Once you’ve found a model you think is a good fit, try to get the model running. The goal here is to run the training and inference loop for the model as-is, not to get it running on your specific dataset or to make any significant modifications. All you want to do is make sure that you have the right dependencies and that the model trains and runs as advertised. To that end:

Create a conda environment for the model: You may be trying out multiple models, so create a conda environment (assuming Python) for each model (nvidia-docker is another option here, but personally I find it to be overkill).

I’ll often set up my environments like so: conda create -n <name of the github repo> python=<same version of python used by the repo>

A quick way to figure out which version of python the repo is using is to look at the print statements. If there are no parens, it’s python 2.7, otherwise 3.6 should work.

Install the libraries: I highly recommend starting off by installing the exact same version of the framework that the original code used to start. If the model says it works with pytorch>0.4.0 , don’t assume it’ll work with pytorch 1.0. At this stage, you don’t want to be fixing those kinds of bugs, so start with pytorch=0.4.0 . You can install a particular version of a framework (eg. pytorch) with the command conda install pytorch=0.4.0 -c pytorch . A lot of code won’t have a requirements.txt file, so it may take some sleuthing and iterating to figure out all the libraries you need to install.

Get the original dataset and run the scripts: At this point, you should be able to download the original dataset and run the testing and training script. You’ll probably have to fix some paths here and there and use the README and source to figure out the correct parameters. If a pre-trained model is available, start with the testing script and see if you’re getting similar results to the paper.

Once you have the testing script running, try to get the training script up. You’ll probably have to work through various exceptions and make slight modifications to get it to work. Ultimately your goal with the training script is to see the loss decreasing with each epoch.

If it’s straightforward (ie. only requires changing some command-line flags), at this point you might try running training script on your own dataset. Otherwise, we’ll do this in step 7.

Step 6: Create your own testing notebook

At this point, you’ve confirmed that the model works and you have the right environment set up to be able to use it. Now you can dig in and start really playing with it. At this point, I recommend creating a Jupyter notebook, copy-pasting in the testing script, and then modifying till you can use it with a single item of data. For example, if you’re using an object detection model that finds dogs in an image, you want a notebook where you can pass it a picture and have it output the bounding boxes of the dogs.

The goal here is to get a feel for the inputs and outputs, how they must be formatted and how exactly the model works, without having to deal with the additional complexity of training or munging your own data into the right formats. I recommend doing this in a Jupyter notebook because I find that being able to see the outputs along each step is really helpful in figuring it out.

Step 7: Create your own training notebook with your dataset

Now that you have some familiarity with the model and data, it’s time to try to create a training notebook. Similar to step 6, I start by copying and pasting in the training script, separating it into multiple cells, and then modifying it to fit my needs.

If you’re already feeling comfortable with the model, you may want to go directly to modifying the training notebook so it works with your dataset. This may involve writing dataloaders that output the same format as the existing dataloaders in the model (or simply modifying those dataloaders). If you’re not yet comfortable enough to do that, start by just getting the training script to work as-is in the notebook and removing code that you don’t think is useful. Then work on getting it to work with your dataset.

Keep in mind the goal here isn’t to modify the model, even if it’s not quite solving the exact task you want yet. It’s just to get the model working with your dataset.

Step 8: Start modifying the model to suit your task!

By this point, you should have a notebook that can train the model (including outputting appropriate metrics/visualizations) and a notebook where you can test new models you create. Now is a good time to start to dig in and make modifications to the model (adding features, additional outputs, variations etc) that make it work for your task and/or dataset. Hopefully having the starting point of an existing state of the art model saved you a lot of time and provides better results than what you might get starting from scratch.

There’s obviously a lot happening in this step and you’d use all your existing model building strategies. However, below are some pointers that may be helpful specifically when building off of an existing model.

Modify the dataset before modifying the model: It’s often easier to munge your data into the format the model expects rather than modifying the model. It’s easier to isolate problems and you’re likely to introduce fewer bugs. It’s surprising how far you can sometimes push a model just by changing the data.

Reuse the pre-trained model as much as possible: If your model changes aren’t drastic, try to reuse the pre-trained model parameters. You may get faster results and the benefits of transfer learning. Even if you expand the model, you can often load the pre-trained parameters into the rest of the model (eg. use strict=False when loading the model in pytorch).

Make incremental changes and regularly check the performance: One benefit of using an existing model is you have an idea of the performance you started with. By making incremental changes and checking the performance after each one, you’ll immediately identify when you’ve made a mistake or are going down a bad path.

Ask for help: If you’re totally stuck, try reaching out to the author and asking for some pointers. I’ve found they’re often willing to help, but remember they’re doing you a favor and please act accordingly.

Automatically texturing a 3D model using neural renderer

Step 9: Attribute and Contribute

Depending on the license and how you’re distributing your model, you may be required to provide attribution to the developer of the original code. Even if it’s not required, it’s nice to do it anyway.

And please contribute back if you can! If you come across bugs and fix them during your own development, submit a pull request. I’m sure well-written bug reports are welcome. Finally, if nothing else, sending a quick thank you to the author for their hard work is always appreciated.