Because training deep learning models requires intensive computation, AI researchers are always on the lookout for new and better hardware and software platforms for their increasingly sophisticated models.

Which hardware platforms — TPU, GPU or CPU — are best suited for training deep learning models has been a matter of discussion in the AI community for years. A new Harvard University study proposes a benchmark suite to analyze the pros and cons of each.

“ParaDnn” is the product of a group of Harvard University researchers at the School of Engineering and Applied Sciences. The parameterized benchmark suite conducts subjective comparisons between different computing platforms for deep learning models.

Unlike recent benchmarking efforts that have been limited to relatively small collections of DNN models, ParaDnn was designed to support broad and comprehensive benchmark suites. It can seamlessly generate thousands of multi-layer models, including fully-connected models (FD), convolutional neural networks (CNN) and recurrent neural networks (RNN). Below are the ranges of the hyperparameters and dataset variables for each of the models.

Hyperparameters and dataset variables for FD, CNN and RNN

For the choice of hardware platforms, researchers benchmarked Google’s Cloud TPU v2/v3, NVIDIA’s V100 GPU and Intel Skylake CPU. The platform specifications are summarized below:

Researched hardware platforms

Researchers started with a deep dive into TPU v2 and v3, revealing bottlenecks for computation capability, memory bandwidth, multi-chip overhead and device-host balance. They then conducted a comprehensive comparison between TPU, GPU and CPU to find out which of the hardware platforms perform best with specific tasks, specialized software stacks and specific datatypes.

Researchers discovered that:

TPU are highly-optimized for large batches and CNNs, and have the highest training throughput;

GPU show better flexibility and programmability for irregular computations such as small batches and nonMatMul computations;

CPU have the best programmability, so they achieve the highest FLOPS utilization for RNNs, and support the largest models due to large memory capacity.

Observations summary table

The researchers conclude their parameterized benchmark is suitable for a wide range of deep learning models, and the comparisons of hardware and software offer valuable information for the design of specialized hardware and software for deep learning neural networks.

The paper Benchmarking TPU, GPU, and CPU Platforms for Deep Learning is on arXiv.