Gaussian Error Linear Unit Activates Neural Networks Beyond ReLU Synced Follow Jan 3 · 4 min read

The Gaussian Error Linear Unit (GELU) activation function was introduced in 2018 by UC Berkeley’s Dan Hendrycks and Kevin Gimpel from the Toyota Technological Institute at Chicago. An activation function is the “switch” that triggers neuron output, and its importance has grown as networks have deepened. In recent weeks a number of discussions in the machine learning community have brought GELU back into the spotlight.

Early artificial neurons utilized binary threshold units. These hard binary decisions were smoothed with sigmoid activations, enabling a neuron to have a “firing rate” interpretation and to train with backpropagation. This made ReLU (Rectified Linear Units) the most popular activation function due to its feature of gating decisions based upon an input’s sign.

Hendrycks and Gimpel proposed the non-linear activation function GELU, a formulation that relates to stochastic regularizers because it is a modified expectation of adaptive dropout, providing neuron output a higher probabilistic view.

In computer vision, natural language processing, and automatic speech recognition tasks, performance of models using GELU activation functions is comparable to or exceeds that of models using either ReLU or the advanced version ELU (Exponential Linear Unit) activation functions. GELU is compatible with BERT, ROBERTa, ALBERT and other top NLP models.

CDF of N (µ, σ² ) for GELU, ReLU, and ELU.

Researchers compared the performance of the GELU, ReLU, and ELU activation functions on the MNIST Classification task (grayscale images with 10 classes, 60k training examples and 10k test examples). They used a fully connected neural network with GELUs (µ = 0, σ = 1), ReLUs, and ELUs (α = 1). Each 8-layer, 128 neuron wide neural network is trained for 50 epochs with a batch size of 128. In the tests the GELU obtained a median error rate of 7.89 percent, the ReLU obtained 8.16 percent, and the ELU 8.41 percent.

MNIST classification results.

MNIST robustness results.

Researchers also conducted a phone-based speech recognition task using the TIMIT dataset, which contains recordings of 680 speakers in a silent environment. The system is a five-layer, 2048-neuron wide classifier as in with 39 output phone labels and a dropout rate of 0.5. The median test error chosen at the lowest validation error was 29.3 percent for the GELU, 29.5 percent for ReLU, and 29.6 percent for the ELU.

TIMIT phone-based speech recognition classification.

In the CIFAR-10/100 Classification test using colour images with 10/100 classes, 50k training and 10k test examples, researchers used 5000 validation samples to fine-tune the initial learning rate {10 ^ −3,10 ^ −4,10 ^ −5}, and then trained again on the entire training set based on the cross-validated learning rate. They optimized Adam for 200 epochs, and the learning rate decayed to zero on the 100th epoch. Here, the GELU scored a median error rate of 7.89 percent, ReLU scored 8.16 percent, and the ELU 8.41 percent.

CIFAR-10 results.

Results of the various experiments show GELU consistently has the best performance compared with ReLU and ELU, and can be considered a viable alternative to previous nonlinear approaches.

The paper Gaussian Error Linear Units (GELUS) is on arXiv.