Comparing Deep Learning Frameworks: A Rosetta Stone Approach

03/14/2018

6 minutes to read

In this article

This post is authored by Ilia Karmanov, Mathew Salvaris, Miguel Fierro, Danielle Dean, all Data Scientists at Microsoft.

With this blog post, we are releasing a full version 1.0 of this repo, open-source on GitHub at: https://github.com/ilkarman/DeepLearningFrameworks.

We believe deep-learning frameworks are like languages: Sure, many people speak English, but each language serves its own purpose. We have created common code for several different network structures and executed it across many different frameworks. Our idea was to a create a Rosetta Stone of deep-learning frameworks – assuming you know one well, to help anyone leverage any framework. Situations may arise where a paper publishes code in another framework or the whole pipeline is in another language. Instead of writing a model from scratch in your favourite framework it may be easier to just use the "foreign" language.

We want to extend our gratitude to the CNTK, Pytorch, Chainer, Caffe2 and Knet teams, and everyone else from the open-source community who contributed to the repo over the past few months.

In summary, our goals with this release were to create:

A Rosetta Stone of deep-learning frameworks to allow data-scientists to easily leverage their expertise from one framework to another. Optimised GPU code with using the most up-to-date highest-level APIs. A common setup for comparisons across GPUs (potentially CUDA versions and precision). A common setup for comparisons across languages (Python, Julia, R). The possibility to verify expected performance of own installation. Collaboration between different open-source communities.

Results from Benchmarking Deep Learning Frameworks

In the following sections, we review results for training time for one type of CNN model, feature extraction on a pre-trained ResNet50 model, and training time for one type RNN model.

Training Time(s): CNN (VGG-style, 32bit) on CIFAR-10 – Image Recognition

The input for this model is the standard CIFAR-10 dataset containing 50k training images and 10k test images, uniformly split across 10 classes. Each 32 by 32 image is supplied as a tensor of shape (3, 32, 32) with pixel intensity re-scaled from 0-255 to 0-1.

Average Time(s) for 1000 Images: ResNet-50 – Feature Extraction

A pre-trained ResNet50 model is loaded and chopped just after the avg_pooling at the end (7, 7), which outputs a 2048D dimensional vector. This can be plugged into a softmax layer or another classifier such as a boosted tree to perform transfer learning. Allowing for a warm start, this forward-only pass to the avg_pool layer is timed. Note: batch-size remains constant, however filling the RAM on a GPU would produce further performance boosts (greater for GPUs with more RAM).

Training Time(s): RNN (GRU) on IMDB – Sentiment Analysis

The input for this model is the standard IMDB movie review dataset containing 25k training reviews and 25k test reviews, uniformly split across 2 classes (positive/negative). Processing follows Keras approach where start-character is set as 1, out-of-vocab (vocab size of 30k is used) represented as 2 and thus word-index starts from 3. Zero-padded / truncated to fixed axis of 150 words per review.

DL Library K80/CUDA 8/CuDNN 6 P100/CUDA 8/CuDNN 6 Using CuDNN? CNTK 32 15 Yes Keras(CNTK) 86 53 No Keras(TF) 35 26 Yes MXNet 29 24 Yes Pytorch 31 16 Yes Tensorflow 30 22 Yes Julia – Knet 29 * Yes

* = These have not yet been implemented as of publishing of this post. We always welcome contributions from the community to add to the results!

Lessons Learnt

Use auto-tune: Most frameworks use cuDNN's cudnnFindConvolutionForwardAlgorithm() to run an exhaustive search and optimise the algorithm used for the forward-pass of convolutions on your fixed-sized images. Usually this is on by default but some frameworks may require a flag e.g. "torch.backends.cudnn.benchmark=True". Use cuDNN as much as possible: For vanilla RNNs (such as basic GRUs/LSTMs) it is usually possible to call a cuDNN wrapper to improve speed e.g. cudnn_rnn.CudnnGRU() instead of rnn.GRUCell(). The downside is that running inference on CPU later-on may be more challenging. Match shapes: When running on cuDNN, matching the native channel-ordering of NCHW for CNNs and TNC for RNNs shaves off time wasted on reshaping and gets you straight to the matrix-multiplication. Native generators: Using the framework's native generators, where augmentation and even pre-processing (e.g. shuffling) are performed asynchronously via threading gives a speed-boost. For inference make sure to specify flags where possible to save unnecessary gradients being calculated and make sure that layers such as batch-norm and drop-out are properly applied.

When we originally created the repo, there were many little tips and tricks we had to use to ensure we were using the same model between frameworks and it was done in an optimal way. It has been incredible to see how quickly the frameworks have all evolved in just the last few months – many of the original learnings from late 2017 of how to best optimize are obsolete today as the frameworks have been updated.

For example, Keras with a TF backend had channel-ordering hard-coded as channels-last (which is not optimal for cuDNN), so specifying channels-first meant it would reshape after every batch (to the hard-coded value) and slow down training immensely. Now Keras with a TF backend supports native channels-first ordering. Tensorflow could previously be sped up by specifying a flag to use the Winograd algorithm for convolutions, however this is no longer helpful. For fun, check out some of the initial learnings in an earlier version of the repo.

By completing an end to end solution in different frameworks, it is possible to compare the frameworks in several ways. As the same model architecture and data is used by each of the frameworks, the accuracy is extremely similar across the frameworks (indeed, this is one way we tested the code to make sure the same model was being used across frameworks!) Also, the notebooks were developed in a way to allow an easy comparison between the frameworks rather than necessarily for speed.

Of course, while it is tempting to compare different frameworks with these metrics such as speed and inference time, they aren't meant to suggest anything about the overall performance of the framework since they omit important comparisons such as: help and support, availability of pre-trained models, custom layers and architectures, data-loaders, debugging, different platforms supported, distributed training, and much more! They are simply meant to show how to create the same networks across different frameworks and the performance on these specific examples.

A "Travellers Companion" for Deep Learning Frameworks

There are many popular deep learning frameworks that are leveraged in the community, and this is one effort to help AI developers and data scientists leverage different deep learning frameworks as applicable. A related effort is the Open Neural Network Exchange (ONNX) which is an open source interoperability standard for transferring deep learning models between frameworks. ONNX is useful when developing in one framework but wanting to convert to score the model in another for example. Similarly, MMdnn is a set of tools to help users directly convert between different frameworks as well as visualize the model architecture.

The "travellers companions" for deep learning frameworks such as ONNX and MMdnn are like an automatic machine translating machine. In contrast, the repo we are releasing as a full version 1.0 today is like a Rosetta Stone for deep learning frameworks, showing the model building process end to end in the different frameworks. All of these types of efforts combined result in a traveller ready to live in an environment with many languages.

Spin up an Azure Deep Learning Virtual Machine, replicate, and contribute back to the repo – happy coding!

Ilia, Mathew, Miguel & Danielle.