“Choose whoever looks the coolest” — that suggestion might or might not help your Chun-Li character top a tournament in the popular video game Street Fighter. What you really need to factor in are attributes such as health, power, mobility, techniques, and range — and how these will affect your fighter when facing different challenges. Now AI researchers have created a similar evaluation system to assess the relative performance of reinforcement learning (RL) systems across different tasks.

RL researchers have made some impressive breakthroughs in recent years — with agents beating human players at video games DOTA 2 and StarCraft 2, and defeating the world champion in the age-old perfect information board game Go. But not all RL agents are created equal and it is not obvious what type of RL agent should be deployed in specific scenarios. We know that a particular self-learning RL agent can master Go through self-play, but will it be any good at driving a car? Aside from tedious trail-and-error, there are no practical answers to this question.

A new DeepMind paper, Behaviour Suite for Reinforcement Learning (Bsuite), introduces a set of experiments designed to assess the core capabilities of RL agents and help researchers better understand their pros and cons across different applications.

DeepMind researchers designed a series of experiments to evaluate the critical capabilities of RL agents. The aim was to use clear, informative, and scalable problems to study core issues across different learning algorithms. In their bsuite experiments researchers evaluated RL agent behaviors by observing their performance on benchmarks.

Each bsuite experiment had three components:

1. Environments: a fixed set of environments determined by some parameters.

2. Interaction: a fixed regime of agent/environment interaction (e.g. 100 episodes).

3. Analysis: a fixed procedure that maps agent behaviour to results and plots.

To visualize the analysis results of the agents and track results for different algorithms and codebases, researchers defined a score to illustrate the performance on a task to [0,1]. The experiments only examined how the agents behaved in the different environments, rather than analyzing their internal workings. Each experiment had five key qualities:

• Targeted: performance in this task corresponds to a key issue in RL.

• Simple: strips away confounding/confusing factors in research.

• Challenging: pushes agents beyond the normal range.

• Scalable: provides insight on scalability, not performance on one environment.

• Fast: iteration from launch to results in under 30min on standard CPU.

In their “Deep Sea” experiment for example, researchers examined how an agent’s actions could position it to learn more effectively in future time-steps. They implemented a problem as an N×N grid with one-hot encoding for state. They ran the agent on sizes N = 10, 12, .., 50 and compared average regret to optimal after 10k episodes. The summary score computes “the percentage of runs for which the average regret drops below 0.9 faster than the 2 N episodes expected by dithering.”

DeepMind researchers stress that bsuite is neither a replacement for grand AI challenges nor a leaderboard, but rather a collection of diagnostic experiments designed to provide insight into key aspects of agent behaviour. The team notes that just as the MNIST dataset helped advance computer vision technologies, bsuite “can highlight key aspects of agent scalability as a stepping stone to advance RL capabilities.”

Bsuite has been open-sourced on GitHub, and DeepMind is inviting the RL research community to incorporate additional experiments to benefit future studies in the field.

The paper Behaviour Suite for Reinforcement Learning is on arXiv.