In the 1975 film Tommy, the “deaf, dumb, and blind” protagonist overcomes substantial sensory limitations to capture a pinball championship. While it’s difficult to imagine playing a video game without being able to see the screen, that was the challenge taken up by AI researchers from INESC-ID and Instituto Superior Técnico in Lisbon and Pittsburgh’s Carnegie Mellon University. Using cross-modality transfer techniques and reinforcement learning (RL), the researchers produced an agent that can play video games with only the game audio to guide it.

In some respects, an RL policy learned over image and sound inputs succeeding when only sound inputs are available mimics the available sensory data leveraging process that comes as second nature to humans — we use touch and hearing for example to navigate through a dark room.

The new cross-modality transfer RL approach explores how latent representations built by advanced variational autoencoder (VAE) methods might enable RL agents to learn and transfer policies over different input modalities.

The researchers combined different input modalities in a latent space, enabling the RL agent to establish mappings between them. The trained RL agent was then directed to perform tasks with its access restricted to a particular subset of the available modalities (e.g., image). Finally, the RL agent was given access to a different subset of modalities (e.g., sound) and performed the task again. The researchers liken this “three-step pipeline” to learning a perceptual model of the world, learning policies to act in the world, and transferring policies. They used the pipeline to build the RL agent “AVAEs DQN.”

The researchers developed hyper shot scenario experiments inspired by the Atari Space Invaders video game to test their model. Like in the classic game, the RL agent must shoot waves of descending attackers while avoiding their return fire.

The researchers set the observations to include both image and sound components. The RL agent learned to play the game (act in the world) based on image observation and developed policies that mapped the latent space to actions. The RL agent’s performance was then evaluated with only the sound observation, but using policies developed in the last step, where the latent space acts as a mechanism to create a mapping between different input modalities.

The AVAEs+ DQN cross-modality transfer approach improved algorithm efficiency and enabled effective policy transfer between different modalities. The method considerably betters an untrained agent (RANDOM), achieving performance comparable to that of the Sound DQN, which represents a RL agent both trained and tested on the sound modality.

The Image DQN, trained and tested on the image modality, showed the most informative perceptual approach although it lacks policy transfer ability.

The paper Playing Games in the Dark: An approach for cross-modality transfer in reinforcement learning is on arXiv.