This is in continuation of my previous post: https://expoundai.wordpress.com/2019/04/27/maxima-vs-minima-and-global-vs-local/

What is optimization?

Maximizing or minimizing some function relative to some set, often representing a range of choices available in a certain situation. The function allows comparison of the different choices for determining which might be “best.”

Common applications: Minimal cost, maximal profit, minimal error, optimal design, optimal management, variational principles.

In mathematics, computer science and operations research, mathematical optimization (alternatively spelled optimisation) or mathematical programming is the selection of a best element (with regard to some criterion) from some set of available alternatives.

Optimization is a vast ocean in itself and is extremely interesting. In the context of deep learning the optimization objective is to minimize the cost function with respect to the model parameters i.e. the weight matrices.

Gradient: In vector calculus, the gradient is the multi-variable generalization of the derivative. The gradient of a scalar function is denoted by , where (the nabla symbol) is known as the del operator. It packages all the partial derivatives information into a vector.

is a vector valued function and points in the direction of steepest ascent. Imagine if you are standing a some point in the input space of , the gradient vector tells you which direction you should travel, to increase the value of most rapidly.

If you are interested in learning more. Khan academy is a wonderful place to learn about gradients and mathematics in general.

https://www.khanacademy.org/math/multivariable-calculus/multivariable-derivatives/gradient-and-directional-derivatives/v/gradient

Gradient Descent: This is a type of optimization technique where you move towards the minimum in an iterative manner. The steps involved are as follows:

Guess a point (randomly or through some technique). Choose a step size which in deep learning is called the learning rate and one of the most important hyper-parameter in deep learning. Calculate the gradient of the function Move in the opposite direction by subtracting this from the initial guess. This is because we want to descend and gradient gives you steepest ascent direction. Repeat steps 3 and 4 for n times or until a stop criterion is reached

In equation form for deep learning applications it can be written as:

is the learning rate and governs how much you dampen the gradient value while taking the step. is a network parameter.

One more point to mention here is that unless the function is convex the algorithm can get stuck at a local minimum instead of converging to the global minimum.

In mathematics, a real-valued function defined on an n-dimensional interval is called convex (or convex downward or concave upward) if the line segment between any two points on the graph of the function lies above or on the graph. This means that the function itself has one minimum for the strictly convex case.

In summary, the gradient descent is an optimization method that finds the minimum of an objective function by incrementally updating its parameters in the negative direction of the gradient of the function which is the direction of steepest descent. Hope now you understand what is gradient descent optimization technique. In the next post, I’ll be taking examples and discuss how the choice of learning rate and initial guess affects the algorithm convergence. If you liked this post and want more posts in the future please do like, share and subscribe.