If you’ve stumbled upon this blog post, you’ve probably used policy gradient methods in Reinforcement Learning (RL). Or you might have maximised the likelihood in probabilistic models. In both cases, we need to estimate the gradient of the loss, which is an expectation over random variables.

The problem is that you cannot just differentiate the objective. Usually, you will apply the score function trick (aka log likelihood trick) here. We can view this trick as providing a differentiable function, whose gradient is an estimate of the gradient of the original objective. We can then apply any deep learning toolbox to do automatic differentiation. However, sometimes we need higher-order gradients, e.g., in meta-learning or multi-agent RL when we need to differentiate through other agents’ learning steps. This makes life much harder.

Infinitely Differentiable Monte Carlo Estimator (DiCE) [1] to the rescue! You can apply the magic objective repeatedly infinitely many times to get the correct higher order gradients under Stochastic Computation Graph (SCG) formalism [2]. This lets automatic differentiation software do the job instead of us manipulating the graph manually. We illustrate the benefits of our approach applying “Learning with Opponent Learning Awareness” (LOLA) [3] to the iterated prisoner’s dilemma.

DiCE

As we mention above, in the surrogate loss (SL) approach, we choose an objective, whose gradient equals the true gradient of the objective and use this function to do the optimisation.

Sadly, constructing surrogate loss using the first-order gradient as an objective leads to wrong second-order gradient estimation. Simply put, applying SL twice and estimating the gradient is not the same as the second-order gradient of the true objective.

The wrong estimation happens because, in the SL approach, we treat part of the objective as a sampled cost. This causes the corresponding terms to lose a functional dependency on the sampling distribution.

We illustrate our reasoning graphically in the figure below using Stochastic Computation Graphs (SCGs) (Schulman et al. 2015) formalism.

We introduce the magic operator, which allows us to compute the gradient to any order we like: .

DiCE is easy to implement:

(1)

where is an operator which sets the gradient of its operand to zero ( detach in Pytorch and stop_gradient() in Tensorflow:

magic-box def magic_box(x): return tf.exp(x - tf.stop_gradient(x)) 1 2 def magic_box ( x ) : return tf . exp ( x - tf . stop_gradient ( x ) )

Alternatively, we can rewrite DiCE in the following way:

(2)

The figure below shows an example of DiCE applied to an RL problem:

Variance Reduction

Variance reduction is an integral part of Monte Carlo estimation.

Though DiCE is not limited to the RL case, we are most interested in policy gradients that use the score function trick.

DiCE inherently reduces variance by taking causality into account. The cost node is multiplied by the sum of the gradients of the log probabilities only for those nodes that influence .

Now we propose another variance reduction mechanism by adding the following term to the DiCE objective:

(3)

where is any function of nodes not influenced by . The baseline keeps the gradient estimation unbiased and does not influence the evaluation of the original objective .

The flaw of becomes apparent when we calculate second-order gradients. In two words, some the terms do not have control variates keeping variance high.

To fix the problem, we can subtract the following term from the objective to reduce the second-order gradient variance:

(5)

where is the set of stochastic nodes that depend on and at least one other stochastic node.

Code example

To show DiCE in action, we apply it to the iterated prisoner’s dilemma (IPD). In IPD, two agents iteratively play matrix games where they can either (C)ooperate or (D)efect. The first agent’s payoffs are the following: -2 (DD), 0 (DC), -3 (CD), -1 (CC).

Let’s build policies for both agents first:

Dice in action def build_policy(scope, env, theta, max_steps, reuse=None): pi = {} with tf.variable_scope(scope, reuse=reuse): # placeholders and variables initialisation omitted for brevity # acs is short for actions logits = tf.reduce_sum( tf.multiply(pi['obs_ph'], tf.reshape(pi['theta'], shape=(1, 1, -1))), axis=-1, keepdims=True) logits = tf.concat([logits, tf.zeros_like(logits)], -1) pi['value'] = tf.reduce_sum( tf.multiply(pi['obs_ph'], tf.reshape(pi['theta_val'], shape=(1, 1, -1))), axis=-1, keepdims=True) pi['value_target'] = tf.reduce_sum( tf.multiply(pi['obs_ph'], tf.reshape(pi['theta_val_target'], shape=(1, 1, -1))), axis=-1, keepdims=True) pi['log_pi'] = tf.nn.log_softmax(logits) pi['acs_onehot'] = tf.one_hot(pi['acs_ph'], env.NUM_ACTIONS, dtype=tf.float32) pi['log_pi_acs'] = tf.reduce_sum( tf.multiply(pi['log_pi'], pi['acs_onehot']), axis=-1) pi['pi'] = tf.nn.softmax(logits) pi['pi_acs'] = tf.reduce_sum( tf.multiply(pi['pi'], pi['acs_onehot']), axis=-1) ac_logp0_cumsum = [tf.reshape(pi['log_pi_acs'][0], [1, -1]) ] for i in range(1,max_steps): ac_logp0_cumsum.append(tf.add(ac_logp0_cumsum[-1], pi['log_pi_acs'][ i])) pi['log_pi_acs_cumsum'] = tf.concat(ac_logp0_cumsum,0) pi['predict'] = tf.squeeze(tf.multinomial( tf.reshape(pi['log_pi'], shape=(-1, env.NUM_ACTIONS)), 1)) pi['loss_value'] = tf.reduce_mean( tf.reduce_sum( tf.pow(tf.squeeze(pi['value']) - pi['target'], 2), axis=0 ) # sum over all steps ) # average over all batches return pi policies = [build_policy("pi_%d" % (i + 1), env, theta, max_steps) for i, theta in enumerate(thetas)] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 def build_policy ( scope , env , theta , max_steps , reuse = None ) : pi = { } with tf . variable_scope ( scope , reuse = reuse ) : # placeholders and variables initialisation omitted for brevity # acs is short for actions logits = tf . reduce_sum ( tf . multiply ( pi [ 'obs_ph' ] , tf . reshape ( pi [ 'theta' ] , shape = ( 1 , 1 , - 1 ) ) ) , axis = - 1 , keepdims = True ) logits = tf . concat ( [ logits , tf . zeros_like ( logits ) ] , - 1 ) pi [ 'value' ] = tf . reduce_sum ( tf . multiply ( pi [ 'obs_ph' ] , tf . reshape ( pi [ 'theta_val' ] , shape = ( 1 , 1 , - 1 ) ) ) , axis = - 1 , keepdims = True ) pi [ 'value_target' ] = tf . reduce_sum ( tf . multiply ( pi [ 'obs_ph' ] , tf . reshape ( pi [ 'theta_val_target' ] , shape = ( 1 , 1 , - 1 ) ) ) , axis = - 1 , keepdims = True ) pi [ 'log_pi' ] = tf . nn . log_softmax ( logits ) pi [ 'acs_onehot' ] = tf . one_hot ( pi [ 'acs_ph' ] , env . NUM_ACTIONS , dtype = tf . float32 ) pi [ 'log_pi_acs' ] = tf . reduce_sum ( tf . multiply ( pi [ 'log_pi' ] , pi [ 'acs_onehot' ] ) , axis = - 1 ) pi [ 'pi' ] = tf . nn . softmax ( logits ) pi [ 'pi_acs' ] = tf . reduce_sum ( tf . multiply ( pi [ 'pi' ] , pi [ 'acs_onehot' ] ) , axis = - 1 ) ac_logp0_cumsum = [ tf . reshape ( pi [ 'log_pi_acs' ] [ 0 ] , [ 1 , - 1 ] ) ] for i in range ( 1 , max_steps ) : ac_logp0_cumsum . append ( tf . add ( ac_logp0_cumsum [ - 1 ] , pi [ 'log_pi_acs' ] [ i ] ) ) pi [ 'log_pi_acs_cumsum' ] = tf . concat ( ac_logp0_cumsum , 0 ) pi [ 'predict' ] = tf . squeeze ( tf . multinomial ( tf . reshape ( pi [ 'log_pi' ] , shape = ( - 1 , env . NUM_ACTIONS ) ) , 1 ) ) pi [ 'loss_value' ] = tf . reduce_mean ( tf . reduce_sum ( tf . pow ( tf . squeeze ( pi [ 'value' ] ) - pi [ 'target' ] , 2 ) , axis = 0 ) # sum over all steps ) # average over all batches return pi policies = [ build_policy ( "pi_%d" % ( i + 1 ) , env , theta , max_steps ) for i , theta in enumerate ( thetas ) ]

Now, let’s build the DiCE objective:

Build DiCE objective def get_dice_objective(scope, policies): dependencies = magic_box( sum( pi['log_pi_acs_cumsum'] for pi in policies) ) baseline = 1 - magic_box( sum( pi['log_pi_acs'] for pi in policies) ) # first-order baseline baseline_h = tf.multiply(1 - dependencies[:-1], baseline[1:]) # second-order baseline losses = [ tf.reduce_mean( tf.reduce_sum( tf.multiply(pi['rews_ph'], dependencies), axis=0 ) ) + tf.reduce_mean( tf.reduce_sum(tf.multiply(tf.squeeze( tf.stop_gradient( tf.multiply(pi['value'], pi['discount_vec']) ) ), baseline), axis=0) ) - tf.reduce_mean( tf.reduce_sum(tf.multiply(tf.squeeze( tf.stop_gradient( tf.multiply(pi['value'], pi['discount_vec']) )[1:] ), baseline_h ), axis=0) ) for pi in policies ] return losses v_1_player, v_2_player = get_dice_objective("delta", policies) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 def get_dice_objective ( scope , policies ) : dependencies = magic_box ( sum ( pi [ 'log_pi_acs_cumsum' ] for pi in policies ) ) baseline = 1 - magic_box ( sum ( pi [ 'log_pi_acs' ] for pi in policies ) ) # first-order baseline baseline_h = tf . multiply ( 1 - dependencies [ : - 1 ] , baseline [ 1 : ] ) # second-order baseline losses = [ tf . reduce_mean ( tf . reduce_sum ( tf . multiply ( pi [ 'rews_ph' ] , dependencies ) , axis = 0 ) ) + tf . reduce_mean ( tf . reduce_sum ( tf . multiply ( tf . squeeze ( tf . stop_gradient ( tf . multiply ( pi [ 'value' ] , pi [ 'discount_vec' ] ) ) ) , baseline ) , axis = 0 ) ) - tf . reduce_mean ( tf . reduce_sum ( tf . multiply ( tf . squeeze ( tf . stop_gradient ( tf . multiply ( pi [ 'value' ] , pi [ 'discount_vec' ] ) ) [ 1 : ] ) , baseline _ h ) , axis = 0 ) ) for pi in policies ] return losses v_1_player , v_2_player = get_dice_objective ( "delta" , policies )

Computing the gradient or hessian of the parameters is just calling tf.gradients() or tf.hessians() on the parameters:

Grad computation grad_p = [ tf.gradients(v, theta)[0] for v, theta in zip([v_1_player, v_2_player], thetas_all)] h_ps = [tf.hessians(v, theta) for v, theta in zip([v_1_player, v_2_player], thetas_all)] 1 2 grad_p = [ tf . gradients ( v , theta ) [ 0 ] for v , theta in zip ( [ v_1_player , v_2_player ] , thetas_all ) ] h_ps = [ tf . hessians ( v , theta ) for v , theta in zip ( [ v_1_player , v_2_player ] , thetas_all ) ]

You can find the complete working example here.

Empirical Results

Let’s now see the empirical verification of DiCE. From the figure below we can see that the second-order baseline helps us to match the analytically derived Hessian, whereas the first-order one fails to do that.









The following figure shows that however, the quality of the gradient estimation increases with the sample size, does not achieve that performance as does. The results including the second-order baseline are in orange, the ones for first-order only are in blue.

Finally, we will show how DiCE helps us get better performance on IPD using LOLA [3]. Comparing LOLA-DICE agents and the original formulation LOLA-DICE agents discover strategies of high social welfare, replicating the results of the original LOLA paper in a way that is both more direct and efficient.

As we can see in the figure below, the second-order baseline dramatically improves LOLA performance on the IPD problem:

Conclusion

In this post, we have described DiCE, a general method for computing any order gradient estimators for stochastic computation graphs. DiCE is easy to implement, however, at the same time it allows us to use the whole power of auto-differentiation software without manually constructing the graph for each order of the gradient. We believe DiCE will be a stepping stone for further exploration of higher order learning methods in meta-learning, reinforcement learning other applications of stochastic computation graphs.

Whether you want to build upon DiCE or are just interested to find out more, you can find our implementation here. For PyTorch lovers there is also an implementation by Alexis David Jacq.

References

Blogpost: Vitaly Kurin, Jakob Foerster, Shimon Whiteson.