NIVIDA announced availability of the the Titan V card Friday December 8th. We had a couple in hand for testing on Monday December 11th, nice! I ran through many of the machine learning and simulation testing problems that I have done on Titan cards in the past. Results are not the near doubling in performance of past generations... but read on.

NVIDA has set high expectations with it's incredible performance leaps with past generation updates to the Titan. My initial tests with the Titan V are not showing that large performance gain of the past. I am optimistic that I'll be able to better exploit the unique features of the card after I spend more time with it. There is no doubt that this is the best desktop "video card" available and it has already shown surprisingly good results on the desktop in areas where it wasn't really expected to excel. I feel that it is a bargain for $3000 since it is basically an active cooled, desktop system compatible, video card that has most of the same features as the Tesla V100. However, it is no secret that the GeForce 1080Ti and the Titan Xp cards offer fantastic (single precision) compute performance at a significantly lower cost.

I'll start with a description of the test setup followed by testing results and comments about the performance.

Testing the Titan V against Titan Xp

I'm not going to go over all of the spec details and press release info. There is plenty of that scattered all over the web. I'm just going to run some compute jobs on the Titan V and report the results.

Test set-up (Puget Systems Peak Mini)

I did most of my testing using NVIDIA built docker images. The NGC docker registry is now available for local desktop use. You will need to be a registered NVIDIA developer to receive an API key to access the NGC docker registry. I highly recommend it!

Note: I will be doing an updated series of posts in early 2018 on how to install, configure, and use docker and NVIDIA docker on your desktop system. That will be a refresh of my earlier series on this topic and will utilize version 2 of NVIDIA docker and will discus use of the NCG docker registry.

Tested applications,

CUDA nbody and NAMD Molecular Dynamics Simulation.

The first thing I do when I get a new NVIDA card is to setup a CUDA development environment and compile the included samples. Then I run nbody as a benchmark. This is a classical physics, many-body force (gravity) calculation. The second test job I like to run is a molecular dynamics calculation on the million atom "satellite tobacco mosaic virus a.k.a. stmv" using NAMD. These are both challenging mathematical compute applications and have excellent GPU acceleration.

CUDA nbody and NAMD on Titan V and Titan Xp

CUDA 9.1 nbody , NAMD 2.12 nbody -benchmark -numbodies=256000 namd-multicore-memopt +p16 +setcpuaffinity +idlepoll stmv.namd

Card nbody single precision GFLOP/s NAMD run time (sec) NAMD day/ns Titan V (Volta) 7159 (9107)* 24.6 0.480 Titan Xp 8409 (7600)* 25.8 0.525

Notes:

If I did not specify "-numbodies" the results in parentheses were obtained! That was 81920 bodies for the Titan V case. That's significantly better for Titan V and I'm not sure why.

For NAMD the number that is most meaningful is day/ns, smaller is better. (that's how much of a day it takes to simulate 1 nano-second)

For nbody I did runs using CUDA 9.1 installed directly on the machine I was using and compared with job runs using the NVIDA CUDA 9.0 dev docker image from the NGC repository, results were essentially the same for both.

NAMD was run using the NIVIDA docker image in the NGC repository.

Old Results from my "NVIDIA Titan GPUs (3 generations)" post

Card nbody single precision GFLOP/s NAMD run time (sec) NAMD day/ns Titan X (Pascal) 7507 41 0.570 TITAN X (Maxwell) 4292 55 0.889 Titan Black 2302 81 1.460

I included this table of results from a post I did a bit over a year ago. That was CUDA 8.0rc and an older version of NAMD. That table illustrates the remarkable performance gains of past generations of Titan. I'm not completely sure what to think of the nbody result for the Titan V. nbody is not necessarily a "good" benchmark but it served well to show the relative performance in the past. The Titan V results for NAMD are very good but not dramatically so. NAMD can be CPU bound, however, I did check results with fewer CPU cores and got nearly the same results.

The Titan V did well on these tests but it was not the astoundingly better performance that we have seen in the past.

nbody Double Precision (fp64)

One of the very strong features of the Titan V is it's terrific double precession floating point compute performance. This is something not typically exploited on GPU's because in the past single precision on GPU's has been much faster. GeForce cards have nearly all had "crippled" fp64 performance (with the exception of the original Titan). The Titan V has the full double precision (fp64) performance of the Tesla V100. Volta has the highest ratio of double (fp64) to single (fp32) performance of any architecture NVIDA has produced. The ratio is 1:2, that means fp64 is half the performance of fp32. That is really good! Here's what happens when you run the nbody simulation with fp64 and compare with the Titan Xp

Double precision (fp64) nbody results with Titan V and Titan Xp

CUDA 9.1 nbody nbody -benchmark -fp64

Card nbody double precision (fp64) GFLOP/s Titan V (Volta) 4456 Titan Xp 348

That's a performance increase over the Titan Xp of 1280% i.e. almost 13 times faster!

Machine Learning Tests with Titan V and Titan Xp

The Titan V was lanuch at the 2017 NIPS conference (Conference and Workshop on Neural Information Processing Systems). NIPS is an important conference for the machine learning crowd. NVIDIA GPU's are half of the reason that we have seen such an explosion of interest and activity in machine learning and AI. The other half is tht we now have mountains of data to work with. The heavy compute end of machine learning is largely driven by NVIDIA GPU's. (and they play a very important role on the deployment i.e. inference, end of things too!)

Since this is probably why you are reading this post lets get to some results.

These are preliminary results of some "standard" machine learning test job runs. They are not exploiting Tensor-cores and half-precision. Tensor-cores are one of the unique hardware features of the Volta architecture and have the largest potential for dramatic performance gains. However, they utilize half-precision and will "require" code tuning and possibly complete rethinking of algorithm implementation. I was not able to find anything that benefited from Tensor-cores "out-of-the-box". I'll let you know when I do!

All testing in this section was done using NVIDIA-docker with images from the NVIDIA NGC repositories.

Convnet benchmark with Tensorflow

Convnet is a convenient "convolution neural network" (CNN) benchmark set that can be run on many machine learning frameworks. I count 19 frameworks on the GitHub repo. You can get the source and run scripts (Python) from the convnet-benchmarks GitHub page. I ran forward and backward propagation steps for the the GoogleNet V1 CNN using Tensorflow with 100 steps with a batch size of 128.

Note: I'm using NVIDIA docker V2 in case you are wondering about docker run --runtime=nvidia . You can use the old plugin syntax nvidia-docker run in both version 1 and 2. ... I'll write about setting up NVIDIA docker V2 soon.)

Convnet benchmark, GoogleNetV1 with Tensorflow on Titan V and Titan Xp

docker run --runtime=nvidia --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -v /home/kinghorn/projects:/projects nvcr.io/nvidia/tensorflow:17.12 python benchmark_googlenet.py --batch_size="128" --num_batches="100" --data_format="NCHW" | tee GPU-googlenet-128.outV

Card (forward-backward) GoogleNetV1

batch size=128 Titan V (Volta) 0.164 sec/batch Titan Xp 0.201 sec/batch

Approx. 20% speedup with Titan V

Tensorflow LSTM Language Model Training

This is an LSTM language modeling training run using a very large word corpus. It is included in the "nvidia-examples" directory of the Tensorflow docker image in the NGC repository.

Tensorflow LSTM (Train) on 1 Billion Word Benchmark Dataset on Titan V and Titan Xp

python single_lm_train.py --mode=train --logdir=/logs --num_gpus=1 --datadir=./data/1-billion-word-language-modeling-benchmark-r13output/ --hpconfig run_profiler=False,max_time=${MAX_TRAIN_TIME_SEC},num_steps=20,num_shards=8,num_layers=2,learning_rate=0.2,max_grad_norm=1,keep_prob=0.9,emb_size=1024,projected_size=1024,state_size=8192,num_sampled=8192,batch_size=512

Card Tensorflow LSTM (Train)

1 billion word dataset Titan V (Volta) 8227 words per second Titan Xp 7541 words per second

Approx. 10% speedup with Titan V

DIGITS v6.0 with Caffe ImageNet Model Training

I have tested with NVIDIA DIGITS in the past, for example, NVIDIA DIGITS with Caffe - Performance on Pascal multi-GPU. NVIDIA DIGITS has a nicely done browser based interface and the new version 6.0 now includes Tensorflow in addition to Caffe and Torch frameworks. I used a training image set from IMAGENET Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) and ran 1 epoch uisng GoogleNet with a batch size of 64.

DIGITS 6.0 with Caffe, GoogLeNet Model Training on 1.3 Million Image Dataset on Titan V and Titan Xp

docker run --runtime=nvidia run --shm-size=1g --ulimit memlock=-1 --rm -it -p 8888:5000 -v /home/kinghorn/projects/data-mnt:/data -v /home/kinghorn/projects/data-mnt/jobs:/workspace/jobs nvcr.io/nvidia/digits:17.12

Card Caffe GoogleNet CNN

ImageNet 1.3 million images 1 epoch Titan V (Volta) 31 min. Titan Xp 35 min.

Approx. 12% speedup with Titan V

Conclusions and Recommendations

The Titan V is a great card and it has better compute performance than the Titan Xp. However, I didn't see large performance improvements with the limited testing that I did. It's not like the (phenomenal) performance gains we have seen by in the past. I used programs that I would consider to be common for use with GPU acceleration but they did not exploit the new hardware features of the Titan V.

The double precision of the Volta architecture is outstanding and this has been fully enabled on the Titan V. However, developers have long used single precision for GPU acceleration since it has offed the best performance and was relatively easy to adapt algorithms to.

The most intriguing new hardware feature of the titan V is the Tensor-cores. The performance potential of these hardware units could possibly give an order of magnitude performance increase to algorithms that can exploit them. However, this requires the used of half-precision (fp16) and could be challenging for developers to exploit. I did briefly try to get some jobs running with Caffe2 and TensorRT that would utilize the Tensor-cores but in the short time that I have been working with the card I was not able to get useful results. Support for Tensor-cores is available in NVIDIA's cuDNN and cuBLAS libraries so I expect to see more programs using this feature soon. I will continue to work on that and will certainly write about it in the future.

Personally I think the idea of Tensor-cores is brilliant. However, I'm not to excited about half-precision (fp16). That's only 4 digits of precision ... "what could possibly go wrong". I can see how you could get away with that in some cases but I'm still waiting to be convinced that it's a "good thing".

Recommendation: I feel the that the Titan V is a bargain at $3000. It has most of the performance and features of the Tesla V100 in a desktop workstation friendly design. For developers working on new CUDA code I would certainly recommend it. For those developers on tighter budgets and those mostly interested in using existing programs the Titan Xp and 1080Ti offer very good performance for a more modest cost, especially the 1080Ti.

Secondary Recommendation: The docker images available in the NVIDIA NGC repository are very good. This is another example of NVIDIA's excellent support of the ecosystem around GPU accelerated computing. Highly recommended! I will be writing about how to setup and utilize NVIDIA docker and this repository soon.

Happy Computing! --dbk