We first formulate finite horizon probabilistic planning tasks with terminal rewards only. Later we will generalize to infinite horizons where rewards can be received at any point in time. Let denote the d dimensional continuous state of a behaving agent at time t. The goal of the agent is to find the sequence of states that maximizes the total received reward (i.e., the return) at the end of the trial.

Such planning tasks can be modeled as inference problems1,2 where the joint distribution over state sequences and returns is given by . The distribution p(x 0 ) encodes an initial prior over states, corresponds to the state transition model and the distribution determines the probability of receiving the return r given the trajectory of states . As in related work1,2,24 we assume that can be factorized as . In this formulation, r denotes a binary random variable, where without loss of generality, the probability of observing such a binary return event is a modulo rescaling of the original reward maximization problem1.

An agent can use such an internal model of the environment to plan a sequence of movements by solving the inference problem

where is a term that guarantees that equation (1) is normalized. In our formulation of the planning problem, the actions in the probabilistic model are integrated out. It is assumed that the actions can be subsequently inferred from the posterior over state sequences.

The unconstrained process for planning models a freely moving agent by

Sampling from this probability distribution can be implemented by a recurrent network of spiking neurons (e.g., using ideas from7,8,9). However, it is not straightforward for a recurrent network to solve the inference problem in equation (1), which requires to integrate future returns backward in time. Only local temporal information is available when sampling from the network. Such temporal models are different to model-based Markov decision process methods encoding global value or Q functions25.

We propose here a solution to this problem that relies on replacing with a model distribution , where sampling from is implemented with an extended neural network architecture and is the neural approximation of . The parameters θ are learned such that the Kullback-Leibler divergence between the true posterior for planning in equation (1) and a model distribution converges to zero.

Planning with recurrent neural networks

We propose a recurrent spiking neural network to implement planning. Our network consists of two populations of neurons, which we denote by Y and V (see Fig. 1A). V is a layer of K state neurons that control the state (e.g., the agent's spatial location) of a freely moving agent. These neurons receive lateral connections from neighboring state neurons and from all N neurons in a population of context neurons Y with weights w ki and θ kj . The context neurons produce spatiotemporal spike patterns that represent high-level goals and context information (e.g., the target state that should be reached after T time steps). We show that probabilistic planning problems defined in equation (1) can be implemented in the network by training the synapses θ kj .

Figure 1 Illustration of the model for finite horizon planning. (A) The neural network architecture considered here for solving the probabilistic planning problem. A recurrent layer of state neurons (green) that control the behavior of the agent receive feedforward input from context neurons (blue), the activity of which determine the desired goal. (B,C) A simple planning problem that requires to pass through a passage at two specific points in time. Superposition of network activity averaged over 100 trial runs (B) and decoded network states (C) are shown. Blue dots in (B) show 1 example spike train. (D) The accumulated number of rewards for the spiking network. (E) The Kullback-Leibler divergence between the learned distribution and the true posterior. (C,D) show averages over 100 trial runs. Full size image

We denote the activity of the state neurons at time t by , where ν t,k = 1 if neuron k spiked at time t and ν t,k = 0 else. Discrete random variables x t can be encoded as a multinomial distribution, where one neuron maps to one state instance. For continuous variables a simple encoding scheme is used, i.e., , where and p k is the preferred position of state neuron k.

Analogously, we define the spiking activity of the context neurons at time t by a binary vector . Using these definitions of v t and y t , we define the membrane potential u t,k and firing probability ρ t,k of state neuron k at time t by

The last term f(u t,k ) denotes the activation function, where we only require that it is differentiable. The probability that the network generates a spike sequences of length T starting from a given initial state v 0 is thus

We assume that the transition model (encoded in the synaptic weights w ki ) is known or was acquired in a pre-learning phase, e.g., using contrastive divergence learning26. Using this assumption, we define the goal of probabilistic planning as minimizing the Kullback-Leibler divergence between the true posterior for planning in equation (1) and the model distribution

where denotes the entropy of the true data distribution. Thus, solving the inference problem in equation (1) is equal to minimizing the Kullback-Leibler divergence in equation (5).

Typically, is unknown and we cannot draw samples from it. However, we can draw samples from the model distribution and update the parameters such that the probability of receiving a reward event is maximized

where η denotes a small learning rate. Note that this general update rule is the result of a standard maximum likelihood formulation where we exploited that using equation (1) and equation (2). The update is an instance of the Expectation-Maximization (EM) algorithm27, where evaluating the expectation with respect to corresponds to the E-step and the parameter update realizes the M-step. The update is also related to policy gradient methods28,29,30 with the difference that we interpret the parameters θ as having the role of linear controls31.

To derive the update rule for the proposed neural network architecture, the network dynamics in equation (4) is used in equation (6). For a detailed derivation we refer to the supplement. The spiking network update rule reads

where are the log-odds of neuron k firing at time t. Equation (7) is the general learning rule for arbitrary differentiable activation functions . It adapts the weights θ kj to maximize the return. For many relevant activation functions (e.g., exponential or sigmoid functions), equation (7) turns into a simple reward-modulated Hebbian-type update rule.

We will compare online and offline updates of equation (7). In its stochastic online variant, the E-step is approximated by sampling a finite set of L samples to estimate the expectation32, or in the simplest case after a single sample (L = 1) as done in our experiments. We refer to this as the online approximation of equation (7). With offline updates, implemented as batch learning, the KL divergence between the true posterior for planning in equation (1) and the model distribution converges to zero for L → ∞ (assuming an exact encoding of the state and the transition model). This KL divergence establishes the relation between the inference problem for planning in equation (1) and the introduced problem of finding the network parameters that maximize the expected return.

A finite horizon planning task

To evaluate the spiking neural network model we consider a simple one dimensional planning problem, where the agent moves on a linear track and the activity of the state neurons population V directly determines its position. K = 9 state neurons encode nine discrete locations. A final reward is only received if the agent passes through two obstacles, one at time T/2 and one at time T (see Fig. 1C). Furthermore the agent is constrained not to jump to distant states within one time step. We model this constraint by the state transition model, i.e., , if and (close to) zero otherwise (see the supplement for further details).

Due to the limitation on the state transitions, this problem requires planning ahead in order to avoid the obstacles successfully, i.e., to start moving to the passage before the obstacle actually appears. We show that the optimal planning policy can be learned using the reward modulated update rule in equation (7) in a network where the state neurons follow (soft) winner-take-all (WTA) dynamics. The probability ρ t,k of neuron k to spike at time t is given by . Thus, in each time step exactly one state neuron is active and encodes the current position of the agent.

The precise timing required to solve this task can be learned if the context neurons provide sufficient temporal structure. We study here the case of only one context neuron being active for one time-step, i.e., for j = t and else. The weights θ kj were adapted according to the online approximation of equation (7). Prior to learning, the agent performs a random walk according to the state transition model encoded in the weights w kj , performing successful trials only occasionally. As learning proceeds, the activity of the context neurons shapes the behavior of the agent leading to nearly optimal performance. Figure 1D shows the accumulated reward throughout learning. After 5000 training iterations, the network generates rewarded trajectories in 97.80 ± 4.64% of the trials. We also evaluated a more detailed spiking version of the network model which produced similar results (success rate: 87.40 ± 15.08%, see Fig. 1B,C in the supplement).

In addition to the online learning rule, we also evaluated the offline update rule. The network draws samples from a fixed distribution simulating random walks without any input (the initial state distribution p(v 0 ) was uniform). Offline updates are applied to the parameters θ and Kullback-Leibler divergence converges towards zero with an increasing number of updates as shown in Fig. 1E.

Extension to the infinite horizon problem

Previously, we demonstrated how our network can model finite horizon planning tasks with terminal rewards (i.e., returns). Here we generalize to infinite horizon planning problems where rewards can be received at any point in time. The goal of the planning problem is to optimize the parameters θ in the neural network so that it generates infinite trajectories that maximize the expected total discounted reward , where γ is the discount factor. We can reformulate this planning problem as probabilistic inference in an infinite mixture of Markov chains of finite lengths T1. The corresponding mixture distribution of trajectories is given by

where is the prior distribution over trajectory lengths and is the distribution over spike trains of length T according to the network dynamics in equation (4). The probability of getting a reward r at the end of the trajectory is given by . Intuitively, in infinite horizon planning tasks the agent seeks a solution that balances getting to the goal as fast as possible (imposed by the prior) against the cost of large state jumps (imposed by the state transition model).

For the infinite horizon model we consider network dynamics where each state neuron has a sigmoid activation function, i.e., with . Using this activation function we find that for learning the infinite planning task the parameters θ kj should undergo a change in each time step t where the reward is present, according to

This synaptic weight update can be realized using an eligibility trace33 e t,kj associated to each synapse with dynamics

The eligibility trace is updated in each time step, whereas the weight updates are only applied for . More details on the learning rule can be found in the supplement.

Note that the precise timing of attracting or repelling states cannot be modeled through explicit context neurons per time step as in the finite horizon model (since ). Therefore, we consider stationary activity patterns of context neurons. This assumption implies that after convergence of the parameter updates an attractor cannot be visited twice.

An infinite horizon planning task

To test the infinite horizon model we consider a planning task, where the goal of the agent is to navigate from a given initial state to a target state in a grid maze with obstacles (dimensions [15 × 20]). The network has 300 state neurons, one for each grid cell. The agent can perform only one-step moves, to the left, to the right, up or down, with equally probable transitions in each direction and receives a reward only at the target state. The sampling process is either terminated if the target state is reached, or if the time step exceeds the maximum number of allowed steps (T = 300). The discount factor was γ = 0.98.

With the offline learning rule the learned parameters θ setup a gradient towards the target state, which covers multiple solution trajectories that lead to high total received rewards (θ 0 is chosen such that the agent starts at the initial state). This gradient is indicated by the radii of the dots in the first row in Fig. 2A. Figure 2B illustrates 12 example trajectories with weights obtained after 10000 trials of learning. Out of the shown 12 trajectories 9 reached the target state that is denoted by the black horizontal lines.

Figure 2 Illustration of the model for infinite horizon planning. (A) The agent has to move from the red cross the black cross. The radii of the dots are proportional to the log of the θ parameters. The results for the offline and the online learning rules are shown in the two rows, respectively. (B) Illustration of 12 sampled trajectories after 10000 trials of offline learning. (C) The mean of the received rewards over 20 experiments. We compare to Monte-Carlo policy evaluation (MC). Full size image

With the online learning rule, the learned parameters θ specialize on one locally optimal path through the maze, which is illustrated in the second row of Fig. 2A. In the evaluated example, there are two locally optimal trajectories that are also the global optima. They are shown in the inset in Fig. 2C. For both, the offline and the online updates the average received reward converges to the maximum value (see Fig. 2C), where we compare to Monte-Carlo policy evaluation (MC)25.