How to customize the optimizers to speed-up and improve the process of finding a (local) minimum of the loss function using TensorFlow.

By Benoit Descamps, BigData Republic.

Introduction

Neural Networks play a very important role when modeling unstructured data such as in Language or Image processing. The idea of such networks is to simulate the structure of the brain using nodes and edges with numerical weights processed by activation functions. The output of such networks mostly yield a prediction, such as a classification. This is achieved by optimizing on a given target using some optimisation loss function.



In a previous post, we already discussed the importance of customizing this loss function, for the case of gradient boosting trees. In this post, we shall discuss how to customize the optimizers to speed-up and improve the process of finding a (local) minimum of the loss function.

Optimizers

While the architecture of the Neural Network plays an important role when extracting information from data, all (most) are being optimized through update rules based on the gradient of the loss function.

The update rules are determined by the Optimizer. The performance and update speed may heavily vary from optimizer to optimizer. The gradient tells us the update direction, but it is still unclear how big of a step we might take. Short steps keep us on track, but it might take a very long time until we reach a (local) minimum. Large steps speed up the process, but it might push us off the right direction.

Adam [2] and RMSProp [3] are two very popular optimizers still being used in most neural networks. Both update the variables using an exponential decaying average of the gradient and its squared.

Research has been done into finding new optimizers, either by generating fixed numerical updates or algebraic rules.

Using a controller Recurrent Neural Network, a team [1] found two new interesting types of optimizers, PowerSign and AddSign, which are both performant and require less ressources than the current popular optimizers, such as Adam.

Implementing Optimizers in TensorFlow

Tensorflow is a popular python framework for implementing neural networks. While the documentation is very rich, it is often a challenge to find your way through it.

In this blog post, I shall explain how one could implement PowerSign and AddSign.

The optimizers consists of two important steps:

compute_gradients() which updates the gradients in the computational graph

which updates the gradients in the computational graph apply_gradients() which updates the variables

Before running the Tensorflow Session, one should initiate an Optimizer as seen below:

# Gradient Descent optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)

tf.train.GradientDescentOptimizer is an object of the class GradientDescentOptimizer and as the name says, it implements the gradient descent algorithm.

The method minimize() is being called with a “cost” as parameter and consists of the two methods compute_gradients() and then apply_gradients().

For this post, and the implementation of AddSign and PowerSign, we must have a closer look at this last step apply_gradients().

This method relies on the (new) Optimizer (class), which we will create, to implement the following methods: _create_slots(), _prepare(), _apply_dense(), and _apply_sparse().

_create_slots() and _prepare() create and initialise additional variables, such as momentum.

_apply_dense(), and _apply_sparse() implement the actual Ops, which update the variables. Ops are generally written in C++ . Without having to change the C++ header yourself, you can still return a python wrapper of some Ops through these methods.

This is done as follows:

def _create_slots(self, var_list): # Create slots for allocation and later management of additional # variables associated with the variables to train. # for example: the first and second moments. ''' for v in var_list: self._zeros_slot(v, "m", self._name) self._zeros_slot(v, "v", self._name) ''' def _apply_dense(self, grad, var): #define your favourite variable update # for example: ''' # Here we apply gradient descents by substracting the variables # with the gradient times the learning_rate (defined in __init__) var_update = state_ops.assign_sub(var, self.learning_rate * grad) ''' #The trick is now to pass the Ops in the control_flow_ops and # eventually groups any particular computation of the slots your # wish to keep track of: # for example: ''' m_t = ...m... #do something with m and grad v_t = ...v... # do something with v and grad ''' return control_flow_ops.group(*[var_update, m_t, v_t])

Let us now put everything together and show the implementation of PowerSign and AddSign.

First, you need the following modules for adding Ops,

# This class defines the API to add Ops to train a model. from tensorflow.python.ops import control_flow_ops from tensorflow.python.ops import math_ops from tensorflow.python.ops import state_ops from tensorflow.python.framework import ops from tensorflow.python.training import optimizer import tensorflow as tf

Let us now implement AddSign and PowerSign. Both optimizers are actually very similar and make use of the sign of the momentum m-hat and gradient g-hat for the update.

PowerSign

For PowerSign the update of the variables w_(n+1) at the (n+1)-th epoch, i.e.,

The decay-rate f_n in the following code is set to 1. I will not discuss this here, and I refer to the paper [1] for more details.

class PowerSign(optimizer.Optimizer): """Implementation of PowerSign. See [Bello et. al., 2017](https://arxiv.org/abs/1709.07417) @@__init__ """ def __init__(self, learning_rate=0.001,alpha=0.01,beta=0.5, use_locking=False, name="PowerSign"): super(PowerSign, self).__init__(use_locking, name) self._lr = learning_rate self._alpha = alpha self._beta = beta # Tensor versions of the constructor arguments, created in _prepare(). self._lr_t = None self._alpha_t = None self._beta_t = None def _prepare(self): self._lr_t = ops.convert_to_tensor(self._lr, name="learning_rate") self._alpha_t = ops.convert_to_tensor(self._beta, name="alpha_t") self._beta_t = ops.convert_to_tensor(self._beta, name="beta_t") def _create_slots(self, var_list): # Create slots for the first and second moments. for v in var_list: self._zeros_slot(v, "m", self._name) def _apply_dense(self, grad, var): lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype) alpha_t = math_ops.cast(self._alpha_t, var.dtype.base_dtype) beta_t = math_ops.cast(self._beta_t, var.dtype.base_dtype) eps = 1e-7 #cap for moving average m = self.get_slot(var, "m") m_t = m.assign(tf.maximum(beta_t * m + eps, tf.abs(grad))) var_update = state_ops.assign_sub(var, lr_t*grad*tf.exp( tf.log(alpha_t)*tf.sign(grad)*tf.sign(m_t))) #Update 'ref' by subtracting 'value #Create an op that groups multiple operations. #When this op finishes, all ops in input have finished return control_flow_ops.group(*[var_update, m_t]) def _apply_sparse(self, grad, var): raise NotImplementedError("Sparse gradient updates are not supported.")

AddSign

AddSign is very similar to PowerSign as seen below,

class AddSign(optimizer.Optimizer): """Implementation of AddSign. See [Bello et. al., 2017](https://arxiv.org/abs/1709.07417) @@__init__ """ def __init__(self, learning_rate=1.001,alpha=0.01,beta=0.5, use_locking=False, name="AddSign"): super(AddSign, self).__init__(use_locking, name) self._lr = learning_rate self._alpha = alpha self._beta = beta # Tensor versions of the constructor arguments, created in _prepare(). self._lr_t = None self._alpha_t = None self._beta_t = None def _prepare(self): self._lr_t = ops.convert_to_tensor(self._lr, name="learning_rate") self._alpha_t = ops.convert_to_tensor(self._beta, name="beta_t") self._beta_t = ops.convert_to_tensor(self._beta, name="beta_t") def _create_slots(self, var_list): # Create slots for the first and second moments. for v in var_list: self._zeros_slot(v, "m", self._name) def _apply_dense(self, grad, var): lr_t = math_ops.cast(self._lr_t, var.dtype.base_dtype) beta_t = math_ops.cast(self._beta_t, var.dtype.base_dtype) alpha_t = math_ops.cast(self._alpha_t, var.dtype.base_dtype) eps = 1e-7 #cap for moving average m = self.get_slot(var, "m") m_t = m.assign(tf.maximum(beta_t * m + eps, tf.abs(grad))) var_update = state_ops.assign_sub(var, lr_t*grad*(1.0+alpha_t*tf.sign(grad)*tf.sign(m_t) ) ) #Create an op that groups multiple operations #When this op finishes, all ops in input have finished return control_flow_ops.group(*[var_update, m_t]) def _apply_sparse(self, grad, var): raise NotImplementedError("Sparse gradient updates are not supported.")

Performance testing the Optimizers

The Rosenbrock function is a famous performance test for optimization algorithms. The function is non-convex, and defined as,

The resulting shape is plotted in figure (1) below. As we seen, it has a minimum at x = 1 and y = 1.

The following script generates the Euclidian distance of the true minimum w.r.t the approximated minimum by a given optimizer at each epoch.

def RosenbrockOpt(optimizer,MAX_EPOCHS = 4000, MAX_STEP = 100): ''' returns distance of each step*MAX_STEP w.r.t minimum (1,1) ''' x1_data = tf.Variable(initial_value=tf.random_uniform([1], minval=-3, maxval=3,seed=0),name='x1') x2_data = tf.Variable(initial_value=tf.random_uniform([1], minval=-3, maxval=3,seed=1), name='x2') y = tf.add(tf.pow(tf.subtract(1.0, x1_data), 2.0),tf.multiply(100.0, tf.pow(tf.subtract(x2_data, tf.pow(x1_data, 2.0)), 2.0)), 'y') global_step_tensor = tf.Variable(0, trainable=False, name='global_step') train = optimizer.minimize(y,global_step=global_step_tensor) sess = tf.Session() init = tf.global_variables_initializer()#tf.initialize_all_variables() sess.run(init) minx = 1.0 miny = 1.0 distance = [] xx_ = sess.run(x1_data) yy_ = sess.run(x2_data) print(0,xx_,yy_,np.sqrt((minx-xx_)**2+(miny-yy_)**2)) for step in range(MAX_EPOCHS): _, xx_, yy_, zz_ = sess.run([train,x1_data,x2_data,y]) if step % MAX_STEP == 0: print(step+1, xx_,yy_, zz_) distance += [ np.sqrt((minx-xx_)**2+(miny-yy_)**2)] sess.close() return distance

A performance comparison of each optimizer is plotted below for a run of 4000 epochs.

While the performance heavily vary from the choice of hyperparameters, the extremely fast convergence of PowerSign needs to noticed.

Below, the coordinates of the approximations have been plotted for several epochs.

Epoch Rmsprop (x,y,z) AddSign (x,y,z) PowerSign (x,y,z) 0 (-2.39, -1.57, 4.26) (-2.39, -1.57, 4.26) (-2.39, -1.57, 4.26) 501 (0.66, 0.43, 0.13) (0.41, 0.17, 0.34) (0.97, 0.95, 0.0) 1001 (0.83, 0.67, 0.05) (0.55, 030, 0.21) (0.98, 0.96, 0.00) 2001 (0.93, 0.85, 0.03) (0.69, 0.48, 0.09) (0.98, 0.96, 0.00) 3001 (0.96, 0.92, 0.02) (0.78, 0.60, 0.05) (0.98, 0.97, 0.00)

Final Discussion:

Tensorflow allows us to create our own customizers. Recent progress in research have delivered two new promising optimizers,i.e. PowerSign and AddSign.

The fast early convergence of PowerSign makes it an interesting optimizer to combine with others such as Adam.

References:

Additional information on PowerSign and AddSign is available on arxiv paper “Neural Optimizer Search with Reinforcement Learning” , Bello et. al., https://arxiv.org/abs/1709.07417. Kingma, D. P., & Ba, J. L. (2015). Adam: a Method for Stochastic Optimization. International Conference on Learning Representations, 1–13. unpublished I have found a lot of useful information through this stackerflow post, which I have attempted to bundle into this post.

Original. Reposted with permission.

Related