We're releasing two new OpenAI Baselines implementations: ACKTR and A2C. A2C is a synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C) which we've found gives equal performance. ACKTR is a more sample-efficient reinforcement learning algorithm than TRPO and A2C, and requires only slightly more computation than A2C per update.

ACKTR can learn continuous control tasks, like moving a robotic arm to a target location, purely from low-resolution pixel inputs (left).

ACKTR (pronounced "actor") — Actor Critic using Kronecker-factored Trust Region — was developed by researchers at the University of Toronto and New York University, and we at OpenAI have collaborated with them to release a Baselines implementation. The authors use ACKTR to learn control policies for simulated robots (with pixels as input, and continuous action spaces) and Atari agents (with pixels as input and discrete action spaces).

ACKTR combines three distinct techniques: actor-critic methods, trust region optimization for more consistent improvement, and distributed Kronecker factorization to improve sample efficiency and scalability.

Sample and Computational Efficiency

For machine learning algorithms, two costs are important to consider: sample complexity and computational complexity. Sample complexity refers to the number of timesteps of interaction between the agent and its environment, and computational complexity refers to the amount of numerical operations that must be performed.

ACKTR has better sample complexity than first-order methods such as A2C because it takes a step in the natural gradient direction, rather than the gradient direction (or a rescaled version as in ADAM). The natural gradient gives us the direction in parameter space that achieves the largest (instantaneous) improvement in the objective per unit of change in the output distribution of the network, as measured using the KL-divergence. By limiting the KL divergence, we ensure that the new policy does not behave radically differently than the old one, which could cause a collapse in performance.

As for computational complexity, the KFAC update used by ACKTR is only 10-25% more expensive per update step than a standard gradient update. This contrasts with methods like TRPO (i.e, Hessian-free optimization), which requires a more expensive conjugate-gradient computation.

In the following video you can see comparisons at different timesteps between agents trained with ACKTR to solve the game Q-Bert and those trained with A2C. ACKTR agents get higher scores than ones trained with A2C.

Agents trained with ACKTR (right) attain higher scores in a shorter amount of time than those trained with other algorithms, such as A2C (left).

Baseline and Benchmarks

This release includes an OpenAI baseline release of ACKTR, as well as a release of A2C.

We're also publishing benchmarks that evaluate ACKTR against A2C, PPO and ACER on a range of tasks. In the following plot we show performance of ACKTR on 49 Atari games compared to other algorithm: A2C, PPO, ACER. The hyperparameters of ACKTR were tuned by the author of ACKTR solely on one game, Breakout.

ACKTR performance also scales well with batch size because it not only derives a gradient estimate from the information in each batch, but also uses the information to approximate the local curvature in the parameter space. This feature is particularly favorable for large scale distributed training, in which large batch sizes are used.



A2C and A3C

The Asynchronous Advantage Actor Critic method (A3C) has been very influential since the paper was published. The algorithm combines a few key ideas:

An updating scheme that operates on fixed-length segments of experience (say, 20 timesteps) and uses these segments to compute estimators of the returns and advantage function.

Architectures that share layers between the policy and value function.

Asynchronous updates.

After reading the paper, AI researchers wondered whether the asynchrony led to improved performance (e.g. "perhaps the added noise would provide some regularization or exploration?"), or if it was just an implementation detail that allowed for faster training with a CPU-based implementation.

As an alternative to the asynchronous implementation, researchers found you can write a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before performing an update, averaging over all of the actors. One advantage of this method is that it can more effectively use of GPUs, which perform best with large batch sizes. This algorithm is naturally called A2C, short for advantage actor critic. (This term has been used in several papers.)

Our synchronous A2C implementation performs better than our asynchronous implementations — we have not seen any evidence that the noise introduced by asynchrony provides any performance benefit. This A2C implementation is more cost-effective than A3C when using single-GPU machines, and is faster than a CPU-only A3C implementation when using larger policies.

We have included code in Baselines for training feedforward convnets and LSTMs on the Atari benchmark using A2C.