The two-step task alone is not sufficient to determine whether meta-RL can learn to implement truly prospective planning. Akam et al. (2015) have identified at least two non-prospective strategies ("reward-as-cue" and "latent-state") that can masquerade as prospective in the two-step task. Although we can rule out meta-RL using the reward-as-cue strategy (see Supplemental Figure 4), we cannot rule out the latent-state strategy. (Akam and colleagues, 2015, acknowledge that the latent-state strategy requires a form of model to infer the current latent state from observed reward outcomes (Kolling et al., Nature Neuroscience, 19, 1280, 2016) -- however, this is a relatively impoverished definition of model-based compared to fully prospective reasoning using a transition model.) However, other studies hint that true prospective planning may arise when recurrent neural networks are trained by RL in a multi-task environment. Recurrent neural networks can be trained to select reward-maximizing actions when given inputs representing Markov decision problem (MDP) parameters (e.g., images of random mazes) (Tamar et al., Neural Information Processing Systems, 2146, 2016; Werbos et al., IEEE International Conference on Systems, Man and Cybernetics 1764, 1996). A recent study by Duan and colleagues (arXiv, 1611.02779, 2016) also showed that recurrent networks trained on a series of random MDPs quickly adapt to unseen MDPs, outperforming standard model-free RL algorithms. And finally, recurrent networks can learn through RL to perform tree-search-like operations supporting action selection (Silver et al., arXiv 1612.08810, 2016; Hamrick et al., Neural Information Processing Systems, 2016; Graves et al., Nature 538, 471, 2016). We therefore sought to directly test prospective planning in meta-RL. To this end, we designed a novel revaluation task in which the agent was given five steps to act in a 32-state MDP whose transition structure was fixed across training and testing (for experiments using related tasks, see Kurth-Nelson et al., Neuron, 91, 194, 2016; Huys et al., PNAS, 112, 3098, 2015; Keramati et al., PNAS, 113, 12868, 2016; Lee et al., Neuron, 81, 687, 2014). The reward function was randomly permuted from episode to episode, and given as part of the input to the agent. Because planning requires iterative computations, the agent was given a ‘pondering’ (Graves, arXiv:1603.08983, 2016) period of five steps at the start of each episode. A) Transition structure of revaluation task. From each state, two actions were available (red and green arrows). For visualization, nodes are placed to minimize distances between connected notes. Rewards were sampled on each episode by randomly permuting a 32-element vector containing ten entries of +1, 21 entries of −1, and one entry of +5 -- node size in diagram shows a single sampled reward function. There were over a billion possible distinct reward functions: orders of magnitude more than the number of training episodes. Episodes began in a random state. Meta-RL’s network architecture was identical to our other simulations, except the LSTM had 128 units. B) Meta-RL reached near optimal performance, when tested on reward functions not included in training. Significantly, meta-RL outperformed a ‘greedy’ agent, which followed the shortest path toward the largest reward. Meta-RL also outperformed an agent using the successor representation (SR). Q(1): Q-learning with eligibility trace; Rand.: random action selection. Bars indicate standard error. C) Correlation between network value output (‘baseline’) and ground-truth future reward grows during pondering period (steps 1 through 5). This indicates that the network used the pondering period to perform calculations that steadily improved its accuracy in predicting future reward. D) Canonical correlation between LSTM hidden state and several task variables, performed independently at each time step. Because “last action” was given as part of the input to the network, we orthogonalized the hidden state against this variable before calculating the canonical correlation. Unsurprisingly, we found that a linear code for action appeared most robustly at the step when the action was taken (the first action occurred on step 6, after five steps of ‘pondering’). However, this signal also ramped up prior to the onset of action. (The network cannot have a perfect representation of which action it will take, because the network’s output is passed through a noisy softmax function to determine the actual action given to the environment.) The network also maintained a strong representation of which state it currently occupied. We note that the network’s knowledge of the transition probabilities was acquired through RL. However, our theory does not exclude the existence of other neural learning mechanisms capable of identifying transition probabilities and other aspects of task structure. Indeed, there is overwhelming evidence that the brain identifies sequential and causal structure independent of reward (Glascher et al. Neuron, 66, 585, 2010; Tolman et al., Psychological Review, 55, 189, 1948). In subsequent work, it will be interesting to consider how such learning mechanisms might interact with and synergize with meta-RL (e.g., Hamrick et al., Neural Information Processing Systems, 2016). All analyses here are done over 1000 evaluation episodes, using 1 fully trained network. Other simulations using slightly different hyperparameters were conducted but not reported, and yielded very similar results.