We just released the new version of ML-Agents toolkit (v0.4), and one of the new features we are excited to share with everyone is the ability to train agents with an additional curiosity-based intrinsic reward.

Since there is a lot to unpack in this feature, I wanted to write an additional blog post on it. In essence, there is now an easy way to encourage agents to explore the environment more effectively when the rewards are infrequent and sparsely distributed. These agents can do this using a reward they give themselves based on how surprised they are about the outcome of their actions. In this post, I will explain how this new system works, and then show how we can use it to help our agent solve a task that would otherwise be much more difficult for a vanilla Reinforcement Learning (RL) algorithm to solve.

Curiosity-driven exploration

When it comes to Reinforcement Learning, the primary learning signal comes in the form of the reward: a scalar value provided to the agent after every decision it makes. This reward is typically provided by the environment itself and specified by the creator of the environment. These rewards often correspond to things like +1.0 for reaching the goal, -1.0 for dying, etc. We can think of this kind of rewards as being extrinsic because they come from outside the agent. If there are extrinsic rewards, then that means there must be intrinsic ones too. Rather than being provided by the environment, intrinsic rewards are generated by the agent itself based on some criteria. Of course, not any intrinsic reward would do. We want intrinsic rewards which ultimately serve some purpose, such as changing the agent’s behavior such that it will get even greater extrinsic rewards in the future, or that the agent will explore the world more than it might have otherwise. In humans and other mammals, the pursuit of these intrinsic rewards is often referred to as intrinsic motivation and tied closely to our feelings of agency.

Researchers in the field of Reinforcement Learning have put a lot of thought into developing good systems for providing intrinsic rewards to agents which endow them with similar motivation as we find in nature’s agents. One popular approach is to endow the agent with a sense of curiosity and to reward it based on how surprised it is by the world around it. If you think about how a young baby learns about the world, it isn’t pursuing any specific goal, but rather playing and exploring for the novelty of the experience. You can say that the child is curious. The idea behind curiosity-driven exploration is to instill this kind of motivation into our agents. If the agent is rewarded for reaching states which are surprising to it, then it will learn strategies to explore the environment to find more and more surprising states. Along the way, the agent will hopefully also discover the extrinsic reward as well, such as a distant goal position in a maze, or sparse resource on a landscape.

We chose to implement one specific such approach from a recent paper released last year by Deepak Pathak and his colleagues at Berkeley. It is called Curiosity-driven Exploration by Self-supervised Prediction, and you can read the paper here if you are interested in the full details. In the paper, the authors formulate the idea of curiosity in a clever and generalizable way. They propose to train two separate neural-networks: a forward and an inverse model. The inverse model is trained to take the current and next observation received by the agent, encode them both using a single encoder, and use the result to predict the action that was taken between the occurrence of the two observations. The forward model is then trained to take the encoded current observation and action and predict the encoded next observation. The difference between the predicted and real encodings is then used as the intrinsic reward, and fed to the agent. Bigger difference means bigger surprise, which in turn means bigger intrinsic reward.

By using these two models together, the reward not only captures surprising things, but specifically captures surprising things that the agent has control over, based on its actions. Their approach allows an agent trained without any extrinsic rewards to make progress in Super Mario Bros simply based on its intrinsic reward. See below for a diagram from the paper outlining the process.

Pyramids environment

In order to test out curiosity, no ordinary environment will do. Most of the example environments we’ve released through v0.3 of ML-Agents toolkit contain rewards which are relatively dense and would not benefit much from curiosity or other exploration enhancement methods. So to put our agent’s newfound curiosity to the test, we created a new sparse rewarding environment called Pyramids. In it, there is only a single reward, and random exploration will rarely allow the agent to encounter it. In this environment, our agent takes the form of the familiar blue cube from some of our previous environments. The agent can move forward or backward and turn left or right, and it has access to a view of the surrounding world via a series of ray-casts from the front of the cube.

This agent is dropped into an enclosed space containing nine rooms. One of these rooms contains a randomly positioned switch, while the others contain randomly placed un-movable stone pyramids. When the agent interacts with the switch by colliding with it, the switch then turns from red to green. Along with this change of color, a pyramid of movable sand bricks is then spawned randomly in one of the many rooms of the environment. On top of this pyramid is a single golden brick. When the agent collides with this brick, the agent receives +2 extrinsic reward. The trick is that there are no intermediate rewards for moving to new rooms, flipping the switch, or knocking over the tower. The agent has to learn to perform this sequence without any intermediate help.

Agents trained using a vanilla Proximal Policy Optimization (PPO, our default RL algorithm in ML-Agents) on this task do poorly, often failing to do better than chance (average -1 reward), even after 200,000 steps.

In contrast, agents trained with PPO and the Curiosity-Driven intrinsic reward consistently solve it within 200,000 episodes, and often even in half that time.

We also looked at agents trained with the intrinsic reward signal only, and while they don’t learn to solve the task, they learn a qualitatively more interesting policy which enables them to move between multiple rooms, compared to the extrinsic only policy which has the agent moving in small circles within a single room.

﻿

Using Curiosity with PPO

If you’d like to use curiosity to help train agents in your environments, enabling it is easy. First, grab the latest ML-Agents toolkit release, then add the following line to the hyperparameter file of the brain you are interested in training: use_curiosity: true . From there, you can start the training process as usual. If you use TensorBoard, you will notice that there are now a few new metrics being tracked. These include the forward and inverse model loss, along with the cumulative intrinsic reward per episode.

Giving your agent curiosity won’t help in all situations. Particularly if your environment already contains a dense reward function, such as our Crawler and Walker environments, where a non-zero reward is received after most actions, you may not see much improvement. If your environment contains only sparse rewards, then adding intrinsic rewards has the potential to turn these tasks from unsolvable to easily solvable using Reinforcement Learning. This has applicability particularly when it makes the most sense for simple rewards such as win/lose or completed/failed for tasks.

—

If you do use the Curiosity feature, I’d love to hear about your experience. Feel free to reach out to us on our GitHub issues page, or email us directly at ml-agents@unity3d.com. Happy training!