Tensorflow 1.5.0 has been officially released. And among various new features, one of the big features is CUDA 9 and cuDNN 7 support, which promises double-speed training on Volta GPUs/FP16. But how does it fair on a plain old GTX 840 M? We are going to perform a benchmark on the CIFAR10 dataset to find just that out.

We installed tensorflow-gpu from official pip packages or built it using Bazel to run our tests. If you want to learn more about how we did that, check out our another article here.

We will be performing our benchmark on the famous CIFAR-10 dataset. The CIFAR-10 dataset (Canadian Institute For Advanced Research) is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32×32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6000 images of each class.

(Source: https://en.wikipedia.org/wiki/CIFAR-10)

We are going to perform our benchmark on using the cifar10_train.py file found in tutorials/image/cifar10 in the github tensorflow models repo. We are going to perform the benchmark on Dell Inspiron 15-3878 with Intel i7 processor and 8gigs of RAM. The system has Nvidia GPU (GeForce 840M) with compute capability 5.0. We are going to run the script on 3 sets of configuration of Tensorflow, CUDA Toolkit and cuDNN as listed below.

Tensorflow gpu 1.4.1 with cuda 8.0 and cudnn 6.0 Tensorflow gpu 1.5.0 with cuda 9.0 and cudnn 7.0.5 Tensorflow gpu 1.5.0 with cuda 9.1 and cudnn 7.0.5 Tensorflow gpu 1.8.0 with cuda 9.2 and cudnn 7.1.4

We are going to perform the benchmark on maximum 10000 steps and will calculate the total duration taken to complete 10000 steps using the time.time() function.

1. Benchmark Tensorflow GPU 1.4.1 with CUDA Toolkit 8.0 and cuDNN 6.0

For this configuration, we installed official prebuilt pip package of Tensorflow GPU 1.4.1. Since Tensorflow GPU 1.4.1 requires CUDA Toolkit 8.0 and cuDNN 6.0, we installed it as well. Upon running 10000 steps on the CIFAR-10 dataset, here’s what we find:

We can see this setup is doing roughly 900 examples per second in average and 0.14 seconds per batch of images on average. It took a total of around 1558 seconds which is roughly 26 minutes to run 10000 steps on the CIFAR-10 dataset.

2. Benchmark Tensorflow GPU 1.5.0 with CUDA Toolkit 9.0 and cuDNN 7.0.5

For this configuration, we installed official prebuilt pip package of Tensorflow GPU 1.5.0. We also installed CUDA Toolkit 9.0 and cuDNN 7.05. Upon running 10000 steps on the CIFAR-10 dataset, here’s what we find:

We can see this setup is doing roughly 1240examples per second in average and 0.103 seconds per batch of images on average. It took a total of around 1106 seconds which is roughly 18 minutes to run 10000 steps on the CIFAR-10 dataset.

3. Benchmark Tensorflow GPU 1.5.0 with CUDA Toolkit 9.1 and cuDNN 7.0.5

Since the official pip package of Tensorflow GPU 1.5.0 does not ship with CUDA Toolkit 9.1 support, we had to build Tensorflow 1.5.0 with CUDA Toolkit 9.1 to perform this test. You can build Tensorflow with cuda 9.1 by following tutorial here. And here are the results that we found upon running 10000 steps on CIFAR-10 dataset:

We can see this setup is doing almost similar to our earlier setup in terms of examples per second and seconds per batch of images. It took a total of around 1046 seconds which is roughly 18 minutes to run 10000 steps on the CIFAR-10 dataset.

4. Benchmark Tensorflow GPU 1.8.0 with CUDA Toolkit 9.2 and cuDNN 7.1.4

We built Tensorflow 1.8.0 with CUDA Toolkit 9.2 to perform this test. You can build Tensorflow with cuda 9.2 by following tutorial here. And here are the results that we found upon running 10000 steps on CIFAR-10 dataset:

We can see this setup is doing almost similar to our earlier setup in terms of examples per second and seconds per batch of images. It took a total of around 975 seconds which is roughly 16 minutes to run 10000 steps on the CIFAR-10 dataset.

Conclusion

From above we can conclude that the support of CUDA 9.2 on Tensorflow 1.8.0 considerably increases the speed of training. As we can see Tensorflow 1.8 with CUDA 9.2 and cuDNN 7.1.4 performs up to 37% faster when compared to earlier versions of Tensorflow. Even when compared to the previous versions of CUDA 9 Toolkit with 7.0.5 cuDNN, the latest CUDA 9.2 is around 7-12% faster.