Implementing the gradient accumulation mechanism

Now, after deep-diving into what exactly Keras optimizers are and how they are implemented, we are ready to discuss the different implementation alternatives for a gradient accumulation mechanism.

Per-optimizer specific implementation

It is possible to rewrite any optimizer to support gradient-accumulation. Gradients should be accumulated over a few steps and only then should the optimizer use them for updating the model parameters. This is not optimal as gradient accumulation is a generic approach and should be optimizer-independent, and there are several flaws to this approach:

Every optimizer has a different formula for implementing a different optimization algorithm and therefore will require a different implementation for the respective gradient accumulation version. This is quite version-specific and code modification will be required every time the original implementation will be changed or a new optimizer will be added. It causes code duplication and is less elegant.

A preferable approach is to design the gradient accumulation model so that it can wrap any Keras optimizer regardless of its optimization algorithm.

A generic wrapper for Keras optimizers

By having a generic gradient accumulation mechanism, changes in the original optimizers will not require code updates.

In order to design and implement a generic gradient accumulation mechanism, there are some things that need to be taken into consideration.

Why do we need a complex solution at all?

Running the optimization algorithm on every mini-batch will not result in the same updates for the model parameters. In other words, we cannot just evaluate the optimization algorithm in every step — on every mini-batch. Otherwise, there was no need for gradient accumulation, and we could have just used a smaller batch size.

If we were to use the global batch, all the gradients would have been calculated using the same values of the model parameters — the weights and biases. When splitting the global batch into several mini-batches, evaluating the optimization algorithm every step will cause the model parameters to be updated after every mini-batch. This means that the gradients of all mini-batches will not be calculated using the same values of the weights and biases.

In addition, optimizers use various parameters as part of their formula, and those parameters are updated as part of the evaluation of the optimization algorithm. Updating those parameters every step — after every mini-batch — will result in changes in the state of the optimization algorithm between different mini-batches.

So what do we do?

Our wrapper is a Python class that inherits Keras’s base Optimizer class. We receive the original optimizer as an argument upon creation (in __init__() ), as well as the number of steps we want to accumulate gradients over.

We define all the methods exposed by optimizers (i.e. get_gradients() , set_weights() , get_weights() , etc…) and transparently call the original optimizer’s respective methods. The main logic lives — as expected — in get_updates() .

Let’s start examining get_updates() (can be seen on GitHub as well), and deep-dive into the algorithm and implementation:

Gradient-accumulation wrapper’s get_updates()

Calculating the gradients

The first line (2) should seem familiar, where we calculate the gradients in the same way as other optimizers do. Note that grads will hold the value of the gradients of every mini-batch.

Declaring helper tensors

In line 5, we declare a step counter — called iterations — with an initial value of 0 (pretty similar to other optimizers). We use the step counter to tell if we are either at the first or the last step of the accumulation. To do so, we declare two tensors: first and last (lines 6–7).

first will be set to True every time we passed exactly self.steps steps. Technically, this is when iterations % self.steps will be equal to 0. For example, if we accumulate over five steps, this will be the case at the first step (indexed 0), the sixth step (indexed 5), the eleventh step (indexed 10), etc… In those cases, at the beginning of the step, we want to reset the accumulated gradients to 0 and start accumulating them once again.

last will be set to True every step we want to update the variables. Technically, this is when iterations % self.steps will be equal to self.steps — 1 . Continuing the example from before, this will be the case at the fifth step (indexed 4), tenth step (indexed 9), etc…

The accumulated gradients

In line 10, we declare variables to hold the values accumulated gradients between steps. We declare such a variable for every model parameter — for every trainable weight or bias — with the shape and type of the parameter, and initialize them with zeros.

Using those variables, in line 13 we declare tensors — agrads — to hold the values of accumulated gradients in every step. We use first to tell whether we should start accumulating the gradients from now on or use the gradients accumulated in the previous steps. If first is True — meaning we start accumulating from now on — we use the gradients of the current mini-batch alone. If first is False — meaning we should use the gradients accumulated over the past steps — we add the gradients of the current mini-batch to vagrads . This control flow (checking the value of first ) is generated into the model using K.switch() .

Ok, that’s great, but what about the optimization algorithm itself?

As a generic wrapper, we don’t implement any optimization algorithm. The original optimizer is responsible for that. As we covered, every optimizer implements its mathematical formula in get_updates() . There, the optimizer manages and uses all the needed parameters for the formula (e.g. step counter, learning rate, momentums, etc…). The optimizer stores the values of the parameters in dedicated variables, and every time a parameter needs to be updated, it assigns the new value to its dedicated variable.

The method get_updates() is called once, generating “Assign” tensors that will be evaluated in every step. Some of them are the updates of the model parameters, and the other ones are updates to the optimizer parameters.

As long as we are accumulating gradients, we don’t want any of these updates to happen. We don’t want the model parameters to be updated in order for all the mini-batches to start from the exact same point, in terms of having the same values for the weights and biases. We don’t want the optimizer parameters to be updated in order for the optimizer to advance in the pace as if it were to run on the global batch. For example, we want the step counter to increase only after all mini-batches passed, so the learning rate will be modified at the correct rate.

We want all the updates to take place only after all the mini-batches have passed. Technically, this means we want the updates to occur at the last step of the accumulation — when last is True .

So, if we could have just called the original optimizer’s get_updates() , while (1) making it use the accumulated gradients, and (2) causing all the variable updates to take place only at the last steps of the accumulation, we would have achieved what we wanted.

Fortunately, replacing (hooking) methods is really easy in Python, and by replacing a few methods with a different implementation we can easily achieve exactly that.

Optimizers call their get_gradients() from get_updates() to calculate the gradients of the parameters with respect to the loss. Therefore, we replace the optimizer’s get_gradients() with a function that does nothing but returning the accumulated gradients ( agrads — the tensors we generated in line 13). This will cause the original optimizer to refer to the accumulated gradients in its algorithm and will solve (1). Let’s take a look at a simplified implementation for such a replacement method:

A simplified version of the replacement for get_gradients()

Regarding (2), variables in Keras can be assigned using 3 methods: K.update() , K.update_add() , and K.update_sub() . Optimizers use these methods for all updates — for the model parameters as well as for the optimizer parameters. We replace all three of them (can be seen on GitHub). We want all tensors created using those methods to assign values only in the last mini-batch and to do nothing otherwise. Therefore, in our methods — that replace the three — we wrap every value being assigned with a conditional switch and pass this switch to the respective method. If this is the last mini-batch ( last is True ), we assign the actual value to the variable, otherwise, we assign a value that does not affect the variable. For K.update_add() and K.update_sub() we assign zero, causing the variable to not actually increase or decrease. For K.update() we assign the current value of the variable, causing the variable to keep its current value. Let’s take a look at a simplified implementation for such replacement methods:

A simplified version of the replacement for the assignment methods

Back to our get_updates() , in lines 15–18 we actually replace all those methods. We use helper classes — subclasses of runai.utils.Hook — to do so.

In line 19 we call the original optimizer’s get_updates() . With all those methods replaced, we (1) make it refer to the accumulated gradients, and (2) cause all updates (“Assign” ops) to take place only when last is True .

Updating our parameters

We have two more things we have to do at the end of every step.

First, we have to update our variables to hold the current value of the accumulated gradients. This is done in line 33.