How deep is your learning?

NVIDIA’s new Tensor Cores tested!

Recently, we've had some hands-on time with NVIDIA's new TITAN V graphics card. Equipped with the GV100 GPU, the TITAN V has shown us some impressive results in both gaming and GPGPU compute workloads.

However, one of the most interesting areas that NVIDIA has been touting for GV100 has been deep learning. With a 1.33x increase in single-precision FP32 compute over the Titan Xp, and the addition of specialized Tensor Cores for deep learning, the TITAN V is well positioned for deep learning workflows.

In mathematics, a tensor is a multi-dimensional array of numerical values with respect to a given basis. While we won't go deep into the math behind it, Tensors are a crucial data structure for deep learning applications.

NVIDIA's Tensor Cores aim to accelerate Tensor-based math by utilizing half-precision FP16 math in order to process both dimensions of a Tensor at the same time. The GV100 GPU contains 640 of these Tensor Cores to accelerate FP16 neural network training.

It's worth noting that these are not the first Tensor operation-specific hardware, with others such as Google developing hardware for these specific functions.

Test Setup

For our NVIDIA testing, we used the NVIDIA GPU Cloud 17.12 Docker containers for both TensorFlow and Caffe2 inside of our Ubuntu 16.04.3 host operating system.

AMD testing was done using the hiptensorflow port from the AMD ROCm GitHub repositories.

For all tests, we are using the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) data set.

TensorFlow

Originally based off of an Internal Google development product, TensorFlow is one of the most popular open source deep learning frameworks available to researchers. With GPU support written in CUDA, TensorFlow is a mature framework with support for many different deep learning models.

There are two key things to look for here as far as performance is concerned, batch size and the level of precision for the training model. Batch size allows more items to be passed into the model and processed at once. Batch size will allow for a faster total training time and will show more of a performance delta between devices, but is constrained by the memory available on your system.

The more important detail for this testing is the precision at which the model is trained. With Volta, the ability to train the network at FP16 (half-precision) is enabled. When training in FP16 mode, the specialized Tensor cores of the GV100 GPU are used. In the FP32 mode, the traditional CUDA cores are used for training this network. This provides us with a good comparison for how effective the Tensor cores are compared to traditional GPU stream processors

Please note with these results that the scores of 0 are either because the GPU does not support FP16 training (the Titan Xp and Vega Frontier Edition with the current software), or that the batch size was not supported with our 64GB of system memory.

Across the 3 different models we tested with the TensorFlow application, there are some very common performance traits to be observed. With traditional FP32 operations, the Titan V sees a 15-25% advantage in training over the last generation GP102-based Titan Xp.

The AMD GPU, however, is completely lacking in comparison to both the Titan V and the Titan Xp. This is likely due to the original TensorFlow application being written in the CUDA programming language, as opposed to OpenCL, which would run natively on AMD GPUs. Instead, AMD GPUs must run the hiptensorflow project, which consists of CUDA code converted to universal C++ code through their HIP converter. From the results, it's clear that this converter has a significant performance downsize, and anyone who is interested in training TensorFlow-powered neural networks should look elsewhere than AMD GPUs at the moment.

When taking FP16 into account, there are major performance benefits to the Tensor cores in the Titan V, ranging from 40-80%, and over twice the performance as the last generation Titan Xp running in FP32 mode.

More importantly, to FP16 also allows us to hit higher batch sizes, which means training an entire network will be even faster—an improvement of over 120% in our testing moving from the largest batch sizes we could hit on Titan Xp with FP32 to larger, FP16-based batches on Titan V.

(Editor's note: The AMD Vega architecture has support for "double packed math", essentially double performance in FP16 operations. While I'm not certain it's the case, it would seem likely that these deep learning workloads could see performance improvements if the capability is enabled in future software optimizations.)

Caffe2

In order to validate this performance, we also compared the Titan Xp and the Titan V on another deep learning staple, Caffe2.

Caffe2 is another popular open source deep learning framework developed by Facebook. For our testing, we are using the same ResNet-50 model as we used in some of our TensorFlow testing above.

With the ResNet-50 model, we see similar results from Caffe2 as we did in Tensorflow. FP32-based training sees a 16% increase going to the more powerful Titan V, but FP16 training provides an incredible 94% increase in processing between the Titan Xp (in FP32 mode) and the Titan V at the same batch size.

Much as we expected, NVIDIA's Titan V is the most powerful option for workstation-level deep learning.

Even at $3000, this card is a no-brainer for a researcher or scientist who is on a smaller scale than needing the $150,000 NVIDIA DGX-1 server with 8 V100 GPUs in it, but still wants the ability to iterate quickly on their models or train a multitude of large datasets.