ADAM (Adaptive Moment Estimation) is a, as the name suggests, adaptive optimization algorithm targeted mainly for training deep neural networks.

It is special in the sense that –

1. It combines all the benefits of previous state-of-the-art techniques and eliminates most of their bottlenecks.

2. It is robust towards vanishing and exploding gradients because of reasons that we’ll discuss later.

3. It works great for Non-Stationary Objective Functions (These are those functions whose statistical properties may change along the line)

4. Efficient in optimizing parameter spaces of very high dimensions due to use of lower-order moments.

5. Unlike other algorithms, that in some cases perform somewhat equally, ADAM has very little memory requirements.

The Algorithm itself is really easy to understand:

ADAM Algorithm Pseudo Code

Now let’s break apart the important terms

1. β 1 and β 2 – These are the hyperparameters that decide the weight to be given to previous moment and thus (1- β 1 ) to the current gradient. Similarly, β 2 decides the same for υ 0 i.e. second raw moments or as I like to call it “Velocity” to relate to intuitions in physics (we’ll discuss this intuition later).

2. m t – It is the moment vector or exponentially weighted mean. Basically , what you are doing here is trying to keep the past gradients in consideration while updating by new gradients, so that your gradients could generalize better in case of sparse gradients. Where does this exponential term even came from you may ask? As it turns out, if you try to find the solution of these summations involving previous contiguous terms like in Fibonacci sequence, exponential terms come rather naturally and beautifully in the solution . For more details you can refer to this Linear Algebra Lecture by Prof. Gilbert Strang.

3. v t – These are called second raw moments. There are basically like variances as in like mt being mean and these are just squared sums of the gradients, more specifically a L2 norm . We will build a better intuition about it the next section

4. mˆ and vˆ – These are bias corrected first and second moments. As m and v are initialized by zeros, if they have large β value i.e. tending to 1, they will be heavily biased towards zero at the in the starting few epochs. These (1 − βt) terms are rescaling the vectors to remove the bias. But Why this specific term ? Did god just slipped him a note and said, “Mate! use this, use this!” ? In the words of every single secret agent’s testimony, I can neither confirm nor deny it. But, they did give a mathematical reasoning/proof for it which you can check in the paper. It involves Geometric Means.

Intuition for using β

β relates to the number of days over which you want to average the gradients. As I already mentioned earlier, this averaging helps to stabilise the learning process of learning by protecting it from sparse gradient’s wrath. You said its related to number of days? Give some numbers!. Sure. So the no. of days you are averaging over is somewhat equal to 1/(1-β) days. If β = 0.9 , number of days is something like 10. I am not aware of any mathematical proof but having a general intuition helps a lot.

Now talking about the practical values of β 1 and β 2 you should choose, here a screengrab from Andrew Ng’s lecture. He recommends using these values as per his experience.

Coursera – Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

The reasoning behind using Moments

Let’s imagine a 3D volume in which the horizontal plane represents two parameters of our cost function and the vertical axis is the cost. Generally, you will get a surface something like this with networks like VGG and Resnet.

Source : VGG 56 Cost Surface

For building intuition, we’ll use a more simpler convex surface, which nowadays can be seen in some state-of-the-art networks like DensNet121 through skip connections.

Source : DenseNet-121 Cost Surface

Let’s imagine a ball going down this surface. Have this surface been spherical the ball will always point and fall towards the global minima, but we are not so lucky? are we? When we have a skewed surface the ball initially follows the gradient but after attaining some velocity, Its momentum makes it keep going in the previous direction. It damps oscillations in directions of high curvature by combining gradients with opposite signs. Thus, keeps the ball going towards gentle but consistent gradients. So, the momentum vectors average of the sideways fluctuations and provide the general moving direction.

ADAM’s Update Rule

Source: SGD with momentum update rule

Great thing about this update method is that it bounds the step-sizes so that we don’t overshoot. This can be understood as establishing a trust region around the current parameter value, beyond which the current gradient estimate does not provide sufficient information.

Another intuition mentioned in paper is thinking of coefficient of η as signal-to-noise ratio (SNR). As I mentioned earlier, v is like variance and what does variance represents? It tells us how much of a spread is in our data, like how much noise is there w.r.t mean value. So, if m/√v is small then the noise (v) is large i.e. we are uncertain about our current direction of movement. This equation naturally takes small step in the direction of high uncertainty

You mentioned it’s robust towards diagonal rescaling. How come?

Glad you asked! If we rescale the moments, let’s say, by some C then the velocity vector will also be rescaled by C2 as its the L2 norm. After taking the square root of v we can take C common outside and it will cancel with C in the numerator.

What is the epsilon doing there?

The epsilon is there to provide calculative stability in case the variance become too low i.e. close to zero, epsilon caps the value of the denominator to be at least greater than epsilon

CONVERGENCE

As convergence analysis is very technical I’ll just hover over important findings in that section.

In problem like logistic regression having a dynamic learning rate helps. When learning rate is multiplied by a factor of t^(-1/2), Time complexity of Regret calculation becomes O(1/√T).

RElATED WORK

ADAM is built upon the ideas of previous algorithms RMSprop and AdaGrad.

There are a few important differences between RMSProp with momentum and Adam:

1. RMSProp with momentum generates its parameter updates using a momentum on the rescaled gradient, whereas Adam updates are directly estimated using a running average of first and second moment of the gradient. Thus losing the benefit taking small non-confident steps.

2. RMSprop has no bias-correction term. Hence, biased towards zero.

Talking about AdaGrad, If you set β1 ro be 0 and β2 to close to 1 i.e. (1-2) is infinitesimal and replace alpha with an annealed version in ADAM, it’s same as AdaGrad. AdaGrad can efficiently deal with sparse features and gradients.

EXPERIMENTS

1. Logistic Regression

2. Multi-Layer Neural Networks

The convergence analysis does not apply to non-convex problems, so, authors depend on empirical analysis.

The Neural Net is 2 layers deep with 1000 hidden units each and ReLU activation function

The sum-of-functions (SFO) method (Sohl-Dickstein et al., 2014) is a recently proposed quasi-Newton method that works with mini batches of data and has shown good performance on optimization of multi-layer neural networks. We used their implementation and compared with Adam to train such models. Figure shows that Adam makes faster progress in terms of both the number of iterations and wall-clock time. Due to the cost of updating curvature information, SFO is 5-10x slower per iteration compared to Adam, and has a memory requirement that is linear in the number minibatches. Original Paper

3. Convolution Neural Networks

Although these results are great but a problem that is not mentioned is that of local optimum. The surface of CNN is very sparse and ADAM often gets stuck in one the valleys which are not global but local optimum. There are papers published in recent years that somewhat address this problem like NADAM.

EXTENSION: ADAMAX

In ADAM the step-size is inversely proportional to L2 norm of the gradients. In ADAMAX the L2 norm is generalized to Lp norm. This gets really complex when p is large but when p tends to infinity, very simple algorithm emerges from the mist.

Source: ADAM Paper

In ADAM the denom was (v)1/2 but in ADAMAX as we are using Lp norm, it’ll be (v)1/p.

Source: ADAM Paper

Here’s the Algorithm

Source: ADAM Paper

This Algorithm is used by many of the current machine learning frameworks (Along with NADAM which we will discuss is some later post).

One of the benefits of this algo is that it eliminates the need of initialization bias correction of m and u separately as (1-β) vanishes in u and you can plug the initialization bias for m in the learning rate.

It also has simple trust region being between -α to +α.

TEMPORAL AVERAGING OF PARAMETERS

Authors have also made a small note about temporal averaging of the parameters themselves. Like using exponentially weighted mean for gradients in calculating moments previously published papers Moulines & Bach (2011), Polyak-Ruppert averaging (Polyak & Juditsky, 1992; Ruppert, 1988) suggest temporal averaging helps in better generalizing performance . Instead of originally suggested we will use our exponential weighted mean for this purpose by giving higher weight to more recent parameter values.

CONCLUSION

ADAM is till date one of the best algorithms for optimization and is widely used is nearly all machine learning frameworks with slight tweaking. It’ll continue to develop over the years until a completely new algorithms outperforms it.

Now talking about the paper itself, it’s a pretty easy read but knowledge of multivariate calculus and statistics are required to completely understand the proofs.

References

1. Diederik P. Kingma and Jimmy Lei Ba. Adam : A method for stochastic optimization. 2014. arXiv:1412.6980v9

2. Moulines, Eric and Bach, Francis R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems

3. Polyak-Ruppert averaging (Polyak & Juditsky, 1992; Ruppert, 1988)

Please feel free to point out any mistakes and misconceptions. Here’s a cute baby yoda as a thank you! Bonne journée!