I describe here our recent ICLR paper [1] [code] [talk], which introduces a novel method for model-based reinforcement learning. The main author of this work is Stefan Depeweg, a phd student at Technical University in Munich who I am co-supervising.

The key contribution is in our models: Bayesian neural networks with random inputs, whose input layer contains both input features but also random variables which are propagated forward through the network and transformed into an arbitrary noise signal at the output layer.

The random inputs enable our models to automatically capture complex noise patterns, improving the quality of our model-based simulations and producing better policies in practice.

Problem description

We address the problem of policy search in stochastic dynamical systems. For example, to operate industrial systems such as gas turbines:

An abstraction of these systems is shown below. The current state of the system is denoted as s_t and associated to each state s_t there is a cost c(s_t) given by the function c( · ). At each time step we apply an action a_t that will affect the state of the system at the next time step s_t+1.

The transition from s_t to s_t+1 is not determined only by the action a_t, but also by some noise signal which we cannot control. This noise signal is represented by the dice in the figure. In the turbine example noise originates because the state that we observe consists only of sensor measurements which are an incomplete description of the true state of the system.

To control the system, we can use a policy function a_t = 𝜋(s_t;𝜃) to map the current state s_t into an action a_t. For example, 𝜋( · ;𝜃) could be a neural network with weights 𝜃.

Our goal is to find a policy (a value of 𝜃) that will produce on average low values of the cost function over sequences of state trajectories. For example, we aim to minimize the expectation of

Note that the above expression is random because it depends on the choice of the initial state s_1 and on the random noise in the state transitions.

Batch reinforcement learning

We consider the batch reinforcement learning scenario in which we will not interact with the system during learning. This scenario is common in real-world industrial settings such as turbine control, where exploration is restricted to avoid possible damage to the system.

Therefore, to find an optimal policy we only have available a batch of data D = {(s_t, a_t, s_t+1)} in the form of state transitions obtained from an already running system and we will not be able to collect any additional data.

First, we learn from D a model for p(s_t+1|s_t, a_t), that is, the predictive distribution for the next state s_t+1 as a function of the current state s_t and the action a_t applied. We then connect this model with the policy to obtain p(s_t+1|s_t, a_t = 𝜋(s_t;𝜃)), which describes the evolution of the system when controlled with the policy 𝜋( · ;𝜃).

The previous distribution can be used to perform roll-outs or simulations of state trajectories. We start with a randomly sampled state s_1 and then iteratively sample from p(s_t+1|s_t, a_t = 𝜋(s_t;𝜃)) to obtain a trajectory of states s_1,…,s_T.

The cost function can then be evaluated on the sampled s_1,…,s_T to approximate cost(𝜃). The gradients of this approximation can be used to perform stochastic optimization of and move in directions that will produce on average low values of cost(𝜃).

The effect of noise in optimal control

The optimal policy can be significantly affected by the noise present in the state transitions. This is illustrated by the drunken spider story, which was originally proposed by Bert Kappen [2] and we use here as a motivating example.

A spider has two possible paths to go home: either by crossing the bridge or by walking around the lake. In the absence of noise, the bridge option is prefered since it is shorter. However, after heavily drinking alcohol, the spider’s movements may randomly deviate left or right. Since the bridge is narrow, and spiders do not like swimming, the prefered trajectory is now to walk around the lake.

The previous example shows how noise can significantly affect optimal control. For example, the optimal policy may change depending on whether the level of noise is high or low. Therefore, we expect to obtain significant improvements in model-based reinforcement learning by capturing with high accuracy any noise patterns present in the state transition data.

Bayesian neural networks with random inputs

In practice most modeling approaches for state transition data just assume additive Gaussian noise in s_t+1, namely,

where f_W is for example a neural network with weights W. In this case it is very easy to learn W by maximum likelihood. However, the assumption of additive Gaussian noise is unlikely to hold in real-world settings.

A more flexible model for the noise in the transition dynamics can be obtained by using random inputs in f_W. In particular, we can assume that

Under this model, the input noise variables z_t can be transformed in complicated ways through f_W to produce arbitrary random patterns in s_t+1 as a function of s_t and a_t.

However, now learning W can no longer be done by maximum likelihood because the z_t are unknown. A solution is instead to follow a Bayesian approach and work with the posterior distribution over W and the z_t. This distribution captures our uncertainty about the possible values that these variables may have taken after we see the data in D.

Computing the exact posterior is intractable, but we can learn a Gaussian approximation. The parameters of this approximation can be adjusted by minimizing a divergence with respect to the true posterior. Variational Bayes (VB) is a popular method for this that works by minimizing the Kullback-Leibler divergence.

α-divergence minimization

Instead of using VB, we learn our factorized Gaussian approximation q by minimizing the α-divergence [3,4]. By changing the value of α in this divergence, we can smoothly interpolate between solutions that fit a mode in the true posterior p or that aim to cover multiple modes in p, as illustrated in the following figure:

Interestingly, VB is a particular case of α-divergence minimization with α = 0. Another well-known method for approximate Bayesian inference is expectation propagation, which is obtained with α = 1. In our experiments, we use α = 0.5 since this often produces better probabilistic predictions in practice [4].

Results on a toy example

The figure below illustrates the results of our Bayesian neural networks with random inputs in two toy examples. The training data for each example is shown in the leftmost column. The top row shows a problem with a bi-modal predictive distribution. The bottom row shows a problem with heteroskedastic noise (the noise magnitude is input dependent).

The middle column shows the predictions obtained with a model that assumes only additive Gaussian noise. This model is unable to capture the bi-modality or the heteroskedasticity in the data. The rightmost column shows the predictions of our Bayesian neural networks with random inputs, which can automatically identify the type of random patterns present in the data.

Results on wetchicken problem

We consider now a reinforcement learning benchmark in which a canoeist is paddling on a two-dimensional river, as shown in the leftmost plot in the figure below. There is a drift in the river that pushes the canoeist towards a waterfall located in the top, with the drift being stronger on the right and weaker on the left. If the canoeist falls through the waterfall he has to start again at the bottom of the river.

There are also turbulences in the river which become stronger on the left and weaker on the right. The canoeist receives higher reward the closer he is to the waterfall. Therefore, he will want to be close to the waterfall, but not too close so that he may fall through. This problem is called wetchicken because of its similarities with the game chicken.

The turbulence and the waterfall will make wetchicken a highly stochastic benchmark: the possibility of falling down the waterfall induces a bi-modality in the state transitions, whereas the varying turbulence introduces heteroskedasticity.

The plot in the middle of the figure visualizes the policy found using our Bayesian neural networks with random inputs. This is a nearly optimal policy in which the canoeist paddles to try to stay at the location x ≃ 3.5 and y ≃ 2.5.

The plot in the right shows the policy found with a Gaussian process (GP) model that just assumes additive Gaussian noise. The resulting policy performs very poorly in practice because the GP cannot capture the complex noise patterns present in the data.

Results on the industrial benchmark

We also evaluate the performance of our Bayesian neural networks with random inputs in experiments with a simulator of industrial systems called the “industrial benchmark” [5]. According to the authors: “The “industrial benchmark” aims at being realistic in the sense, that it includes a variety of aspects that we found to be vital in industrial applications.”

The figure below shows, for a fixed sequence of actions, the roll-outs produced with models corresponding to 1) a multilayer perceptron which assumes additive Gaussian noise (MLP) and our Bayesian neural networks trained with 2) Variational Bayes (VB) or 3) α-divergence minimization with α = 0.5. Simulated trajectories are shown in blue and the ground truth one generated by the “industrial benchmark” is shown in red.

This figure clearly shows how the roll-outs produced with our Bayesian neural networks with random inputs and α-divergence minimization are closer to the ground truth trajectory.

Conclusions

We have seen that it is important to account for complicated noise patterns in the transition dynamics when learning optimal policies. Our Bayesian neural networks with random inputs are state-of-the-art models for capturing such complex noise patterns. By minimizing α-divergences with α = 0.5, we are able to perform accurate approximate inference in such Bayesian neural networks. This allows us to produce realistic model-based simulations that can be used to learn better policies.

Further reading

In [6] we study the decomposition of uncertainty in the predictions of Bayesian neural networks with random inputs. Uncertainty originates from either a) the lack of knowledge about the network weights due to limited data (epistemic uncertainty), or b) the random inputs to the network (aleatoric uncertainty). In [6] we show how to separate these two types of uncertainty with applications to active learning and safe reinforcement learning.

We also recommend this excellent blog post by Alex Kendall about the aforementioned two types of uncertainty in deep neural network for computer vision.

References

[1] Depeweg S., Hernández-Lobato J. M., Doshi-Velez F. and Udluft S. Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks, In ICLR, 2017.

[2] H.J. Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and Experiment, page P11011, 2005.

[3] Minka, Thomas P. Divergence measures and message passing. Technical report, Microsoft Research, 2005.

[4] Hernández-Lobato J. M., Li Y., Rowland M., Bui T. D., Hernández-Lobato D. and Turner R. E. Black-Box Alpha Divergence Minimization, In ICML, 2016

[5] Daniel Hein, Alexander Hentschel, Volkmar Sterzing, Michel Tokic, and Steffen Udluft. Introduction to the” industrial benchmark”. arXiv preprint arXiv:1610.03793, 2016

[6] Depeweg, Stefan, et al. “Uncertainty Decomposition in Bayesian Neural Networks with Latent Variables.” arXiv preprint arXiv:1706.08495 (2017).