Deep neural network (DNN) models such as BigGAN, BERT, and GPT 2.0 have demonstrated that larger DNN models produce better task performance. These huge models are however becoming increasingly difficult to train. Google this week introduced GPipe, an open-source library that dramatically improves training efficacy for large-scale neural network models.

In 2014, GoogleNet finished first in the ImageNet visual recognition challenge. The winning model consisted of four million parameters and achieved 74.8 percent accuracy. Three years later, Squeeze-and-Excitation Networks scored 82.7 percent to win the challenge with a model containing 145.8 million parameters — some 36x more than GoogleNet.



Because chip memory growth rates can hardly keep pace with the demands of training larger neural networks, demand has arisen for a scalable and efficient infrastructure to help overcome memory limitation challenges when enabling large-scale DNNs.



GPipe is a distributed machine learning library consisting of multiple sequential layers. It enables an easy deployment process for multiple accelerators and allows developers to train larger models and to scale their performance without tuning hyper-parameters.

Strong correlation between ImageNet accuracy and model size for recently developed representative image classification models

From Mini-to Micro-Batches

Two main methods are currently used to accelerate the training of moderate-size DNN models: data parallelism, which distributes the input data across many machines; and transferring the model to an accelerator such as a GPU or TPU and leveraging that computing power to accelerate the training process. To address memory limitations and communication bandwidth issues with the host machine, Google researchers used model parallelism to train a bigger DNN on the accelerator.



To maximize accelerator efficiency, GPipe partitions a model across various accelerators and automatically splits a “mini-batch” of training samples into smaller “micro-batches.” Accelerators operate in parallel while pipelining the execution.



As shown in the following image, when gradients are accumulated across micro-batches, increasing the partitions will not affect model quality.

Top: The naive model parallelism strategy leads to severe underutilization due to the sequential nature of the network: only one accelerator is active at a time. Bottom: GPipe divides the input mini-batch into smaller micro-batches, enabling different accelerators to work on separate micro-batches at the same time.

Experiment results

To show GPipe’s capability in maximizing memory allocation for model parameters, the research team ran the experiment on Cloud TPUv2s containing 8 accelerator cores and 64 GB memory. Researchers tested the model with and without GPipe, and the results show that GPipe can reduce the intermediate activation memory from 6.26 GB to 3.46 GB, and enable 318 million parameters on a single accelerator. Without GPipe, a single accelerator can only train up to 82 million model parameters under the memory limits.

Speedup of AmoebaNet-D using GPipe. This model could not fit into one accelerator. The baseline naive-2 is the performance of the naive partition approach when the model is split into two partitions. Pipeline-k refers to the performance of GPipe that splits the model into k partitions with k accelerators.

To test efficiency, the research team also measured GPipe effectiveness on AmoebaNet-D model throughput. Because at least two accelerators are required to accommodate the model size, researchers measured speedup in a naive case with two partitions but no pipeline parallelization. The team observed a nearly linear speedup when distributing the model across four times the number of accelerators. Compared to the naive approach with two partitions, the GPipe approach achieved a 3.5x speedup.



The research team also trained an AmoebaNet-B with 557 million model parameters and input image size of 480 x 480 on the ImageNet ILSVRC-2012 dataset. With four divided partitions and application of parallel training processes to both model and data, the network reached a state-of-the-art 84.3% top-1 and 97% top-5 single-crop validation accuracy without any external data.

The paper GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism is on arXiv; and the GPipe Github is here.