Titan RTX vs. 2080 Ti vs. 1080 Ti vs. Titan Xp vs. Titan V vs. Tesla V100.

In this post, Lambda Labs benchmarks the Titan RTX's Deep Learning performance vs. other common GPUs. We measured the Titan RTX's single-GPU training performance on ResNet50, ResNet152, Inception3, Inception4, VGG16, AlexNet, and SSD. Multi-GPU training speeds are not covered.

TLDR;

Hardware Setup

Lambda Dual - Deep Learning Workstation with 2 Titan RTX GPUs

Titan RTX's FP32 performance is...

~8% faster than the RTX 2080 Ti

~47% faster than the GTX 1080 Ti

~31% faster than the Titan Xp

~4% faster than the Titan V

~14% slower that the Tesla V100 (32 GB)

when comparing # images processed per second while training.

Titan RTX's FP16 performance is...

21% faster than the RTX 2080 Ti

110% faster than the GTX 1080 Ti

92% faster than the Titan Xp

2% slower than the Titan V

Stay tuned for comparison to the V100 (32 GB)

when comparing # images processed per second while training.

Pricing

Titan RTX: $2,499.00 (source: NVIDIA's website)

RTX 2080 Ti: ~$1,300.00 (source: Amazon)

Conclusion

RTX 2080 Ti is the best GPU for Machine Learning / Deep Learning if... 11 GB of GPU memory is sufficient for your training needs (for many people, it is). The 2080 Ti offers the best price/performance among the Titan RTX, Tesla V100, Titan V, GTX 1080 Ti, and Titan Xp.

Titan RTX is the best GPU for Machine Learning / Deep Learning if... 11 GB of memory isn't sufficient for your training needs. However, before concluding this, try training at half-precision (16-bit). This effectively doubles your GPU memory at the cost of training accuracy. If you're already successfully training at FP16 and 11 GB still isn't enough, then choose the Titan RTX -- otherwise, go with the RTX 2080 Ti. At half-precision, the Titan RTX offers effectively 48 GB of GPU memory.

Tesla V100 is the best GPU for Machine Learning / Deep Learning if... price isn't important, you need every bit of GPU memory available, or time to market of your product is of utmost important.

Methods

All models were trained on a synthetic dataset to isolate GPU performance from CPU pre-processing performance and reduce spurious I/O bottlenecks.

For each GPU/model pair, 10 training experiments were conducted and then averaged.

The "Normalized Training Performance" of a GPU is calculated by dividing its images / sec performance on a specific model by the images / sec performance of the 1080 Ti on that same model.

The Titan RTX, 2080 Ti, Titan V, and V100 benchmarks utilized Tensor Cores.

Batch-sizes

Model Batch Size ResNet-50 64 ResNet-152 32 InceptionV3 64 InceptionV4 16 VGG16 64 AlexNet 512 SSD 32

Software

Ubuntu 18.04

TensorFlow: v1.11.0

CUDA: 10.0.130

cuDNN: 7.4.1

NVIDIA Driver: 415.25

Raw Results

The tables below display the raw performance of each GPU while training in FP32 mode (single precision) and FP16 mode (half-precision), respectively. Note that the unit measured is # of images processed per second and we rounded results to the nearest integer.

FP32 - Number of images processed per second

Model / GPU Titan RTX 1080 Ti Titan Xp Titan V 2080 Ti V100 ResNet50 312 208 237 300 294 369 ResNet152 115 81 90 107 110 132 InceptionV3 212 136 151 208 194 243 InceptionV4 83 58 63 77 79 91 VGG16 191 134 154 195 170 233 AlexNet 3980 2762 3004 3796 3627 4708 SSD300 162 108 123 156 149 187

FP16 - Number of images processed per second

Model / GPU Titan RTX 1080 Ti Titan Xp Titan V 2080 Ti ResNet50 540 263 289 539 466 ResNet152 188 96 104 181 167 InceptionV3 342 156 169 352 286 InceptionV4 121 61 67 116 106 VGG16 343 149 166 383 255 AlexNet 6312 2891 3104 6746 4988 SSD300 248 122.49 136 245 195

Reproduce the benchmarks yourself

All benchmarking code is available on Lambda Lab's GitHub repo. Share your results by emailing s@lambdalabs.com or tweeting @LambdaAPI. Be sure to include the hardware specifications of the machine you used.

Step One: Clone benchmark repo

git clone https://github.com/lambdal/lambda-tensorflow-benchmark.git --recursive

Step Two: Run benchmark

Input a proper gpu_index (default 0) and num_iterations (default 10)

cd lambda-tensorflow-benchmark ./benchmark.sh gpu_index num_iterations

Step Three: Report results

Check the repo directory for folder <cpu>-<gpu>.logs (generated by benchmark.sh)

Use the same num_iterations in benchmarking and reporting.