The architecture of the supervised network (grid network, light blue dashed) was incorporated into a larger deep reinforcement learning network, including a visual module (green dashed) and an actor–critic learner (based on A3C41; dark blue dashed). In this case the supervised learner does not receive the ground truth \(\vec{{c}_{0}}\) and \(\vec{{h}_{0}}\) to signal its initial position, but uses input from the visual module to self-localize after placement at a random position within the environment. Visual module: since experimental evidence suggests that place cell input to grid cells functions to correct for drift and anchor grids to environmental cues21,27, visual input was processed by a convolutional network to produce place cell (and head direction cell) activity patterns which were used as input to the grid network. The output of the vision module was only provided 5% of the time to the grid network (see Methods for implementational details), akin to occasional observations of salient environmental cues made by behaving animals27. The output of the vision module was concatenated with \(\overrightarrow{u},\overrightarrow{v},\vec{sin\mathop{\varphi }\limits^{^\circ }},\vec{cos\mathop{\varphi }\limits^{^\circ }}\) to form the input to the grid LSTM, which is the same network as in the supervised case (see Methods and Extended Data Fig. 1). The actor–critic learner (light blue dashed) receives as input the concatenation of \(\vec{{e}_{t}^{^{\prime} }}\) produced by a convolutional network with the reward r t , the previous action a t−1 , the linear layer activations of the grid cell network \(\vec{{g}_{t}}\) (current grid-code), and the linear layer activations observed last time the goal was reached, \(\vec{{g}_{\ast }}\)(goal grid-code), which is set to zero if the goal has not been reached in the episode. The fully connected layer was followed by an LSTM with 256 units. The LSTM has two different outputs. The first output, the actor, is a linear layer with six units followed by a softmax activation function, which represents a categorical distribution over the agent’s next action \(\vec{{\pi }_{t}}\). The second output, the critic, is a single linear unit that estimates the value function v t .