In this section, I mention a brief background of the concept of optimization and the basics of physics. Please note that I do not possess the knowledge of the intricate details involved in physics and because of which the insights used for deriving the new equation may not be fully accurate from the standpoint of physics. However, for us i.e. Machine Learning folks, this definitely provides a good direction to move forward since the experimental results show abundant promise.

Mathematical optimization

The concept of Mathematical Optimization forms a distinct field of study and is quite vast. Studying and covering all the possible algorithms used in this field is out of scope for this article. The wikipedia page for this summarizes many of the algorithms.

In fact, while studying Machine Learning, we majorly concentrate on the technique called Gradient Descent. Various modifications have been suggested till now for optimizing the parameter update. The current standard variant of the original gradient descent that we all use is the Adam optimizer (Kingma and Ba, 2015).

To describe the process in short, there is a function J(w) which is commonly known by the names: objective function or loss function or cost function and others. The objective is to update the parameters w such that the function has a minimum value for the updated parameters. Original Gradient descent proposes the following as the update rule:

w = w - (alpha * gradient)

where the alpha is the manually set learning_rate and the term gradient is the derivative of the objective function with respect to the parameter w.

The update equations for the Adam optimization are:

m_0 <- 0 (Initialize initial 1st moment vector)

v_0 <- 0 (Initialize initial 2nd moment vector)

t <- 0 (Initialize timestep) t <- t + 1

lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)



m_t <- beta1 * m_{t-1} + (1 - beta1) * g

v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g

variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)

The learning rate is the most important hyperparameter that if not tuned properly can cause the algorithm to diverge instead of converging. So, this was also one of the motivations for developing the equation (i.e to remove hyperparameters from the update equations).