State-of-the-art (SOTA) deep learning models have massive memory footprints. Many GPUs don't have enough VRAM to train them. In this post, we determine which GPUs can train state-of-the-art networks without throwing memory errors. We also benchmark each GPU's training performance.

TLDR:

The following GPUs can train all SOTA language and image models as of February 2020:

RTX 8000 : 48 GB VRAM, ~$5,500.

: 48 GB VRAM, ~$5,500. RTX 6000 : 24 GB VRAM, ~$4,000.

: 24 GB VRAM, ~$4,000. Titan RTX: 24 GB VRAM, ~$2,500.

The following GPUs can train most (but not all) SOTA models:

RTX 2080 Ti : 11 GB VRAM, ~$1,150. *

: 11 GB VRAM, ~$1,150. * GTX 1080 Ti : 11 GB VRAM, ~$800 refurbished. *

: 11 GB VRAM, ~$800 refurbished. * RTX 2080 : 8 GB VRAM, ~$720. *

: 8 GB VRAM, ~$720. * RTX 2070: 8 GB VRAM, ~$500. *

The following GPU is not a good fit for training SOTA models:

RTX 2060: 6 GB VRAM, ~$359.

* Training on these GPUs requires small batch sizes, so expect lower model accuracy because the approximation of a model's energy landscape will be compromised.

Image models

Maximum batch size before running out of memory

Model / GPU 2060 2070 2080 1080 Ti 2080 Ti Titan RTX RTX 6000 RTX 8000 NasNet Large 4 8 8 8 8 32 32 64 DeepLabv3 2 2 2 4 4 8 8 16 Yolo v3 2 4 4 4 4 8 8 16 Pix2Pix HD 0* 0* 0* 0* 0* 1 1 2 StyleGAN 1 1 1 4 4 8 8 16 MaskRCNN 1 2 2 2 2 8 8 16

*The GPU does not have enough memory to run the model.

Performance, measured in images processed per second

*The GPU does not have enough memory to run the model.

Language models

Maximum batch size before running out of memory

Model / GPU Units 2060 2070 2080 1080 Ti 2080 Ti Titan RTX RTX 6000 RTX 8000 Transformer Big Tokens 0* 2000 2000 4000 4000 8000 8000 16000 Conv. Seq2Seq Tokens 0* 2000 2000 3584 3584 8000 8000 16000 unsupMT Tokens 0* 500 500 1000 1000 4000 4000 8000 BERT Base Sequences 8 16 16 32 32 64 64 128 BERT Finetune Sequences 1 6 6 6 6 24 24 48 MT-DNN Sequences 0* 1 1 2 2 4 4 8

*The GPU does not have enough memory to run the model.

Performance

Model / GPU Units 2060 2070 2080 1080 Ti 2080 Ti Titan RTX RTX 6000 RTX 8000 Transformer Big Words/sec 0* 4597 6317 6207 7780 8498 7407 7507 Conv. Seq2Seq Words/sec 0* 7721 9950 5870 15671 21180 20500 22450 unsupMT Words/sec 0* 1010 1212 1824 2025 3850 3725 3735 BERT Base Ex./sec 34 47 58 60 83 102 98 94 BERT Finetue Ex./sec 7 15 18 17 22 30 29 27 MT-DNN Ex./sec 0* 3 4 8 9 18 18 28

*The GPU does not have enough memory to run the model.

Results normalized by Quadro RTX 8000

Figure 2. Training throughput normalized against Quadro RTX 8000. Left: image models. Right: Language models.

Conclusions

Language models benefit more from larger GPU memory than image models. Note how the right diagram is steeper than the left. This indicates that language models are more memory-bound and image models are more computationally bounded.

GPUs with higher VRAM have better performance because using larger batch sizes helps saturate the CUDA cores.

GPUs with higher VRAM enable proportionally larger batch sizes. Back-of-the-envelope calculations yield reasonable results: GPUs with 24 GB of VRAM can fit a ~3x larger batches than a GPUs with 8 GB of VRAM.

Language models are disproportionately memory intensive for long sequences because attention is quadratic to the sequence length.

GPU Recommendations

RTX 2060 (6 GB): if you want to explore deep learning in your spare time.

if you want to explore deep learning in your spare time. RTX 2070 or 2080 (8 GB): if you are serious about deep learning, but your GPU budget is $600-800. Eight GB of VRAM can fit the majority of models.

if you are serious about deep learning, but your GPU budget is $600-800. Eight GB of VRAM can fit the majority of models. RTX 2080 Ti (11 GB): if you are serious about deep learning and your GPU budget is ~$1,200. The RTX 2080 Ti is ~40% faster than the RTX 2080.

if you are serious about deep learning and your GPU budget is ~$1,200. The RTX 2080 Ti is ~40% faster than the RTX 2080. Titan RTX and Quadro RTX 6000 (24 GB): if you are working on SOTA models extensively, but don't have budget for the future-proofing available with the RTX 8000.

if you are working on SOTA models extensively, but don't have budget for the future-proofing available with the RTX 8000. Quadro RTX 8000 (48 GB): you are investing in the future and might even be lucky enough to research SOTA deep learning in 2020.

Lambda offers GPU laptops and workstations with GPU configurations ranging from a single RTX 2070 up to 4 Quadro RTX 8000s. Additionally, we offer servers supporting up to 10 Quadro RTX 8000s or 16 Tesla V100 GPUs.

Image Models

Model Task Dataset Image Size Repo NasNet Large Image Classification ImageNet 331x331 Github DeepLabv3 Image Segmentation PASCAL VOC 513x513 GitHub Yolo v3 Object Detection MSCOCO 608x608 GitHub Pix2Pix HD Image Stylization CityScape 2048x1024 GitHub StyleGAN Image Generation FFHQ 1024x1024 GitHub MaskRCNN Instance Segmentation MSCOCO 800x1333 GitHub

Language Models