For the last few years leading AI research company DeepMind has been playing a lot of games — taking on top human players in Go, chess, StarCraft, and other contests. But why? The “killer moves” researchers are making in these arenas are part of a much bigger challenge to push the limits of artificial intelligence agents and reinforcement learning (RL).

DeepMind last month announced its latest breakthrough: trained AI agents that can beat 99.8 percent of human players in the complicated real-time strategy video game StarCraft II. AlphaStar is the first system to achieve such performance in StarCraft II. More importantly, the achievement opens a promising path for general-purpose RL.

Doina Precup is Research Team Lead at DeepMind, In her keynote speech at the recent RE•WORK Deep Learning Summit in Montréal, Precup discussed the latest developments in the field: “Reinforcement learning allows autonomous agents to learn how to act in a stochastic, unknown environment, with which they can interact.” Precup also explained how RL could be used to build knowledge bases for AI agents.

As a tool for building knowledge representations in AI agents, the goal of reinforcement learning is to perform continual learning. Early scientific experimentation on animal learning was one of the inspirations that prompted later researchers to explore the use of reinforcement learning in artificial intelligence. The mechanism behind RL is simple: allowing an agent to freely interact with an environment, and assigning reward functions if the agent succeeds in a task and negative rewards for failed attempts.

What does RL have to offer AI? Precup provided four answers:

Growing knowledge and abilities in an environment

Learning efficiently from one stream of data

Reasoning at multiple levels of abstraction

Adapting quickly to new situations

Currently most researchers are focusing on building two types of knowledge with RL: procedural knowledge, such as skills and goal-driven behavior; and predictive and empirical knowledge, which is analogous to the laws of physics and able to predict the effects of actions.

Precup spoke of “options” as a way to encode procedural knowledge, using robot navigation as an example with an initiation set, a policy, and a termination condition. If there is no obstacle (an initiation set) in front of the robot during its navigation, then it will go forward (policy) until it gets too close to another object (a terminations condition).

Precup said current research is making made significant progress in learning options through gradients.

She closed her talk with the thought that there should no longer be just a single task involved when assessing the capability of a lifelong learning agent. Precup proposed that returns are essential yet too simplistic, while qualitative analysis of behaviors is interesting but very difficult. She suggested researchers could formulate a hypothesis about what an agent should know or how it should behave given specific knowledge, and design an experiment to test that hypothesis using continual learning systems.

Finally, although RL can be compute-heavy and time consuming, Precup stressed that researchers need to be patient and “let the agent continue training without tinkering with the task or the algorithm!”