In 2012, Geoffrey Hinton’s research team used only two NVIDIA GPUs to train AlexNet, the revolutionary network architecture that handily won the ImageNet Large Scale Visual Recognition Challenge. It probably never occurred to these groundbreaking researchers that just seven years later, a new team of researchers would use almost 10,000 times more GPUs to train their AI model.

A research team from NVIDIA, Oak Ridge National Laboratory (ORNL), and Uber has introduced new techniques that enabled them to train a fully convolutional neural network on the world’s fastest supercomputer, Summit, with up to 27,600 NVIDIA GPUs. They managed to achieve an impressive, near-linear scaling of 0.93 on distributed training and produce a model capable of atomically-accurate reconstruction of materials — a longstanding scientific problem involving materials imaging.

What is Summit? In June 2018 the US Department of Energy’s Oak Ridge National Laboratory in Tennessee unveiled the world’s fastest supercomputer Summit, whosecomputing power reaches 200 petaflops. Summit employs over 9,000 general-purpose IBM processors and 28,000 NVIDIA graphics processors along with state-of-the-art interconnect, which makes it very well-suited for deep learning workloads.

Why does this research matter? The growing demand for AI compute is driving the development of distributed machine learning to multiple processors, particularly a dataparallelism technique that distributes data across different nodes to process in parallel.

Most existing training techniques are tailored for scenarios that require only tens to hundreds of AI processors. These methods however fail to scale efficiently on supercomputers with tens of thousands of AI accelerators. That has motivated scientists to explore data parallelism problems on supercomputers.

Major contributions: Researchers proposed new gradient reduction strategies which produce optimal overlap between computation and communication. Essentially these new strategies comprise two techniques: a lightweight worker coordination technique (BitAllReduce) and a gradient tensor grouping strategy (Grouping). Researchers say these two orchestration strategies “improve on different aspects of distributed deep learning as currently implemented in Horovod.”

In their experiments, researchers achieved a scaling efficiency of 0.93 at 4600 nodes (27,600 GPUs) during distributed deep learning. In comparison, using 60 NVIDIA K80 GPUs to train ResNet-152 only obtains a 50X speed-up — equivalent to a scaling efficiency of 0.83, according to the benchmark test on TensorFlow.

Researchers measured power consumption — an indicator of the efficiency on a supercomputer — of the main hardware components on Summit during distributed training, and also profiled the compute performance for distributed training with the proposed strategies based on I/O, computation performed for the DNN forward and backward propagation, and communication operations embedded in the computation graph.

Researchers also applied a deep neural network to solve the phase problem in the atomic imaging of materials, using a dataset with size of a 500 terabytes.

Implications: Nouamane Laanait, the paper’s first author, told Synced that “the novel distributed deep learning techniques we introduced both increase the training efficiency of existing deep learning (DL) applications but also open up new frontiers in developing DL models in new domains, for instance the physical sciences, whose massive data sizes necessitate the use of high-performance computing. Scaling up models and data sizes has in the past advanced the field of machine learning, our work is very much in that spirit, where we efficiently scaled-up distributed training to the computing scale of the world’s fastest supercomputer.”

Related news: NVIDIA announced another scaling breakthrough in August — training the BERT-Large language model in just 53 minutes using 1,472 V100 GPUs GPUs. The NVIDIA research group also managed to scale the training of the GPU-2 model 8.3 billion parameters, using 8-way model parallelism and 64-way data parallelism on 512 GPUs.

The paper Exascale Deep Learning for Scientific Inverse Problems is on arXiv.