NVIDIA’s complete solution stack, from GPUs to libraries, and containers on NVIDIA GPU Cloud (NGC) , allows data scientists to quickly get up and running with deep learning. NVIDIA® A100 Tensor Core GPU provides unprecedented acceleration at every scale and across every framework and type of neural network and break records in the available systems category in MLPerf, the AI industries leading benchmark; a testament to our GPU-accelerated platform approach.

NVIDIA® TensorRT™ running on NVIDIA GPUs enable the most efficient deep learning inference performance across multiple application areas and models. This versatility provides wide latitude to data scientists to create the optimal low-latency solution. Visit NVIDIA GPU Cloud (NGC) to download any of these containers.

NVIDIA A100 Tensor Core GPUs provides unprecedented acceleration at every scale and across every framework and type of neural network. The 3rd Generation Tensor Cores brings maximum versatility by accelerating a full range of precisions—from FP32 to FP16 to INT8 and all the way down to INT4 and extends NVIDIA’s AI inference leadership.

NVIDIA V100 Tensor Cores GPUs leverage mixed-precision to combine high throughput with low latencies across every type of neural network. NVIDIA T4 is an inference GPU, designed for optimal power consumption and latency, for ultra-efficient scale-out servers. Read the inference whitepaper to learn more about NVIDIA’s inference platform.

Measuring the inference performance involves balancing a lot of variables. PLASTER is an acronym that describes the key elements for measuring deep learning performance. Each letter identifies a factor (Programmability, Latency, Accuracy, Size of Model, Throughput, Energy Efficiency, Rate of Learning) that must be considered to arrive at the right set of tradeoffs and to produce a successful deep learning implementation. Refer to NVIDIA’s PLASTER whitepaper for more details.

NVIDIA landed top performance spots on all five MLPerf Inference 0.5 benchmarks with the best per-accelerator performance among commercially available products.

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (Server)

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (Offline)

NVIDIA Performance on MLPerf Inference v0.5 Benchmarks (ResNet-50 V1.5 Offline Scenario)

MLPerf Inference Performance

NVIDIA Turing 70W

Network Network

Type Batch

Size Throughput Efficiency Latency GPU Server Container Precision Dataset GPU

Version MobileNet v1 Server - 16,884 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - ImageNet [224x224] NVIDIA T4 MobileNet v1 Offline - 17,726 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - ImageNet [224x224] NVIDIA T4 ResNet-50 v1.5 Server - 5,193 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - ImageNet [224x224] NVIDIA T4 ResNet-50 v1.5 Offline - 5,622 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - ImageNet [224x224] NVIDIA T4 SSD MobileNet v1 Server - 7,078 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - COCO [300x300] NVIDIA T4 SSD MobileNet v1 Offline - 7,609 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - COCO [300x300] NVIDIA T4 SSD ResNet-34 Server - 126 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - COCO [1200x1200] NVIDIA T4 SSD ResNet-34 Offline - 137 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - COCO [1200x1200] NVIDIA T4 GNMT Server - 198 queries/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - WMT16 NVIDIA T4 GNMT Offline - 354 inputs/sec - - 1x T4 Supermicro 4029GP-TRT-OTO-28 - - WMT16 NVIDIA T4

NVIDIA Turing 280W

Network Network

Type Batch

Size Throughput Efficiency Latency GPU Server Container Precision Dataset GPU

Version MobileNet v1 Server - 49,775 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - ImageNet [224x224] TitanRTX MobileNet v1 Offline - 55,597 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - ImageNet [224x224] TitanRTX ResNet-50 v1.5 Server - 15,008 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - ImageNet [224x224] TitanRTX ResNet-50 v1.5 Offline - 16,563 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - ImageNet [224x224] TitanRTX SSD MobileNet v1 Server - 20,503 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - COCO [300x300] TitanRTX SSD MobileNet v1 Offline - 22,945 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - COCO [300x300] TitanRTX SSD ResNet-34 Server - 388 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - COCO [1200x1200] TitanRTX SSD ResNet-34 Offline - 415 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - COCO [1200x1200] TitanRTX GNMT Server - 645 queries/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - WMT16 TitanRTX GNMT Offline - 1,061 inputs/sec - - 1x TitanRTX SCAN 3XS DBP T496X2 Fluid - - WMT16 TitanRTX

Inference Natural Langugage Processing BERT Inference Throughput

NVIDIA A100 BERT Inference Benchmarks

Network Network

Type Batch

Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version BERT-Large with Sparsity Attention 94 6,188 sequences/sec - - 1x A100 DGX A100 - INT8 SQuaD v1.1 - A100 SXM4-40GB

Inference Image Classification on CNNs with TensorRT ResNet-50 v1.5 Throughput







ResNet-50 v1.5 Latency







ResNet-50 v1.5 Power Efficiency

Inference Performance

A100 Inference Performance

Network Batch Size 1/7 MIG Throughput 7 MIG Throughput Full Chip Throughput GPU Server Container Precision Dataset Framework GPU Version BERT-Large 1 240 sequences/sec 1,680 sequences/sec 625 sequences/sec 1x A100 NVIDIA DGX A100 - INT8 SQuaD v1.1 TensorRT 7.1 A100-SXM4-40GB 256 574 sequences/sec 4,018 sequences/sec 4,125 sequences/sec 1x A100 NVIDIA DGX A100 - INT8 SQuaD v1.1 TensorRT 7.1 A100-SXM4-40GB Jasper 1 115 inferences/sec 804 inferences/sec 227 inferences/sec 1x A100 NVIDIA DGX A100 - FP16 LibriSpeech TensorRT 7.1 A100-SXM4-40GB 64 176 inferences/sec 1,230 inferences/sec 1,225 inferences/sec 1x A100 NVIDIA DGX A100 - FP16 LibriSpeech TensorRT 7.1 A100-SXM4-40GB ResNet-50v1.5 128 - - 23,973 images/sec 1x A100 NVIDIA DGX A100 20.06-py3 INT8 Synthetic TensorRT 7.1.2 A100-SXM4-40GB

V100 Inference Performance

Network Batch Size Throughput Efficiency Latency (ms) GPU Server Container Precision Dataset Framework GPU Version MobileNet V1 1 4,732 images/sec 32.27 images/sec/watt 0.21 1x V100 DGX-1 20.06-py3 INT8 Synthetic TensorRT 7.1.2 V100-SXM2-32GB 2 7,395 images/sec 40.04 images/sec/watt 0.27 1x V100 DGX-2H 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H 8 15,144 images/sec 67.62 images/sec/watt 0.53 1x V100 DGX-2H 20.06-py3 INT8 Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H 128 32,774 images/sec 123.34 images/sec/watt 3.91 1x V100 DGX-2H 20.06-py3 INT8 Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H 239 34,405 images/sec 127.2 images/sec/watt 7 1x V100 DGX-2H 20.06-py3 INT8 Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H ResNet-50 1 1,237 images/sec 8.1 images/sec/watt 0.81 1x V100 DGX-1 20.06-py3 INT8 Synthetic TensorRT 7.1.2 V100-SXM2-16GB 2 1,873 images/sec 11.52 images/sec/watt 1.07 1x V100 DGX-1 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM2-16GB 8 4,496 images/sec 16.5 images/sec/watt 1.78 1x V100 DGX-2H 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H 52 8,504 images/sec 21.47 images/sec/watt 6.11 1x V100 DGX-2H 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H 128 7,531 images/sec 26 images/sec/watt 17 1x V100 DGX-1 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM2-16GB 128 8,582 images/sec 21.21 images/sec/watt 6.11 1x V100 DGX-2H 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H ResNet-50v1.5 1 1,069 images/sec 7.4 images/sec/watt 0.94 1x V100 DGX-2 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM3-32GB 2 1,876 images/sec 11 images/sec/watt 1.1 1x V100 DGX-2 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM3-32GB 8 4,349 images/sec 15.75 images/sec/watt 1.84 1x V100 DGX-2H 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H 52 8,051 images/sec 20.09 images/sec/watt 6.46 1x V100 DGX-2H 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H 128 7,097 images/sec 24 images/sec/watt 18 1x V100 DGX-1 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM2-16GB 128 8,145 images/sec 19.9 images/sec/watt 15.72 1x V100 DGX-2H 20.06-py3 Mixed Synthetic TensorRT 7.1.2 V100-SXM3-32GB-H NCF 16,384 33,022,334 samples/sec - samples/sec/watt 0 1x V100 DGX-2H 20.06-py3 Mixed MovieLens 20M PyTorch 1.6.0a0+9907a3e V100-SXM3-32GB-H BERT-BASE 1 766 sequences/sec - sequences/sec/watt 1.31 1x V100 DGX-1 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM2-16GB 2 1,295 sequences/sec - sequences/sec/watt 1.54 1x V100 DGX-1 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM2-16GB 8 2,358 sequences/sec - sequences/sec/watt 3.39 1x V100 DGX-1 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM2-16GB 26 3,002 sequences/sec 13.36 sequences/sec/watt 8.66 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB 128 3,033 sequences/sec 11.82 sequences/sec/watt 42.2 1x V100 DGX-2 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB BERT-LARGE 1 334 sequences/sec 2.57 sequences/sec/watt 3 1x V100 DGX-2H 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB-H 2 549 sequences/sec 3.73 sequences/sec/watt 3.64 1x V100 DGX-2H 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB-H 8 837 sequences/sec 4.1 sequences/sec/watt 9.56 1x V100 DGX-2H 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB-H 128 1,043 sequences/sec 3.1 sequences/sec/watt 123 1x V100 DGX-2H 20.03-py3 Mixed Sample Text TensorRT 7.0.0 V100-SXM3-32GB-H Mask R-CNN 1 16 images/sec 0.12 images/sec/watt - 1x V100 SuperMicro Server 20.06-py3 Mixed COCO 2014 PyTorch 1.6.0a0+9907a3e V100S-PCIE-32GB 2 20 images/sec 0.15 images/sec/watt - 1x V100 SuperMicro Server 20.06-py3 Mixed COCO 2014 PyTorch 1.6.0a0+9907a3e V100S-PCIE-32GB 8 26 images/sec 0.35 images/sec/watt 0.51 1x V100 SuperMicro Server 20.06-py3 Mixed COCO 2014 TensorFlow 1.15.2 V100S-PCIE-32GB ResNeXt101 1 104 images/sec 0.69 images/sec/watt 9.44 1x V100 SuperMicro Server 20.06-py3 FP32 Imagenet2012 PyTorch 1.6.0a0+9907a3e V100S-PCIE-32GB 2 203 images/sec 0.88 images/sec/watt 9.65 1x V100 SuperMicro Server 20.06-py3 FP32 Imagenet2012 PyTorch 1.6.0a0+9907a3e V100S-PCIE-32GB 8 490 images/sec 2.84 images/sec/watt 16.19 1x V100 SuperMicro Server 20.06-py3 Mixed Imagenet2012 PyTorch 1.6.0a0+9907a3e V100S-PCIE-32GB 128 847 images/sec 3.42 images/sec/watt 111.27 1x V100 DGX-2H 20.06-py3 Mixed Imagenet2012 PyTorch 1.6.0a0+9907a3e V100-SXM3-32GB-H Tacotron2 1 1,457 total output mels/sec - total output mels/sec/watt 1.37 1x V100 SuperMicro Server 20.06-py3 FP32 LJSpeech 1.1 PyTorch 1.6.0a0+9907a3e V100S-PCIE-32GB 4 5,914 total output mels/sec - total output mels/sec/watt 1.35 1x V100 SuperMicro Server 20.06-py3 FP32 LJSpeech 1.1 PyTorch 1.6.0a0+9907a3e V100S-PCIE-32GB WaveGlow 1 631,871 output samples/sec - output samples/sec/watt 0.36 1x V100 DGX-2H 20.06-py3 Mixed LJSpeech 1.1 PyTorch 1.6.0a0+9907a3e V100-SXM3-32GB-H 4 646,173 output samples/sec - output samples/sec/watt 1.42 1x V100 DGX-2H 20.06-py3 Mixed LJSpeech 1.1 PyTorch 1.6.0a0+9907a3e V100-SXM3-32GB-H

T4 Inference Performance