A guest post by Daniel Salvadori

The deep learning revolution has been responsible for many recent advances and breakthroughs in fields ranging from computer vision to natural language processing. One particular field that has seen extraordinary growth is deep reinforcement learning. In 2013 DeepMind published “Playing Atari with Deep Reinforcement Learning” in which their model learned to play Atari games just by watching the pixels on the screen. Three years later AlphaGo beat the Go world champion, captivating audiences across the globe. More recently AlphaZero broke with the need to learn from human matches, generalizing learning by self-play to any perfect information game and effectively becoming the world champion in Go, Chess and Shogi.

A Modern Framework

Huskarl is a new open-source framework for deep reinforcement learning focused on modularity and fast prototyping. It’s built on TensorFlow 2.0 and uses the tf.keras API when possible for conciseness and readability. Huskarl recently won first place in the #PoweredByTF 2.0 Challenge. Its goal is to allow researchers to easily implement, test, tune and compare deep-RL algorithms. Similarly to how TensorFlow abstracts away the management of computational graphs, and Keras the creation of high level models, Huskarl abstracts away the agent-environment interaction. This allows users to focus on developing and understanding algorithms, while also preventing data leakage. Huskarl works seamlessly with OpenAI Gym, including the Atari environments. Below is the entire code necessary to create and visualize a DQN agent that learns to balance a cartpole:

The Huskarl DQN agent learning to balance a cartpole.

Currently several algorithms are implemented comprising three tunable agents. The DQN agent implements Deep Q-Learning along with multiple enhancements such as variable-step traces, Double DQN, and an adjustable dueling architecture. DQN is an off-policy algorithm and our implementation uses prioritized experience replay by default. The DQN agent operates on problems where the action space is discrete. The A2C agent implements a synchronous, multi-step version of an Advantage Actor-Critic, an on-policy algorithm. For more information on the difference between A2C and the famous A3C please refer to this blog post by OpenAI. Huskarl allows on-policy algorithms like A2C to easily sample experience from multiple environment instances at once. This helps decorrelate the data into a more stationary process which aids learning. Finally, the DDPG agent implements Deep Deterministic Policy Gradient with variable-step traces, which also uses prioritized experience replay by default. The DDPG agent operates on problems with continuous action spaces.

The Huskarl DDPG agent learning to raise a pendulum.

Huskarl makes it easy to parallelize computation of environment dynamics across multiple CPU cores. This is useful for speeding up on-policy learning algorithms that benefit from multiple concurrent sources of experience, such as A2C or PPO. First, to use multiple environment instances simultaneously, just provide the desired number of instances to both the on-policy agent and to the simulation. Then, to spread the environment instances over multiple processes, which are automatically parallelized over the available CPU cores, simply provide the desired value for the max_subprocesses parameter when calling sim.train() as shown in the snippet below. Additionally, note how straightforward it is to use a different policy for each environment instance — just provide a list of policies instead of a single policy object:

The Huskarl A2C agent learning to balance a cartpole using 16 environment instances simultaneously. The thicker blue line shows the reward obtained using the greedy, target policy. A gaussian epsilon-greedy policy is used when acting in the other 15 environments, with epsilon mean varying from 0 to 1.

It’s worth noting that some environments, like the cartpole one, are so simple that using multiple processes would actually slow down training due to interprocess communication overhead. Only computationally expensive environments benefit from being spread across processes.

In all implemented agents the neural networks used are provided by the user, as they depend on each problem specification. They can be as simple and shallow or as complex and deep as desired. The agents often add one or more layers to the provided neural networks internally in order to correctly perform their expected function. Moreover, all the algorithms take full advantage of custom Keras losses to be as fast and as concise as possible. There are currently three examples included — one for each agent. The examples use tiny fully connected networks to showcase the capability of the agents even when using simple models.

What’s Next

We plan to implement more recent deep reinforcement learning algorithms, such as Proximal Policy Optimization (PPO), Soft Actor-Critic (SAC) and Twin Delayed Deep Deterministic Policy Gradient (TD3). In addition, we also plan to introduce intrinsic reward methods such as curiosity and empowerment. The idea is that users will be able to easily swap and combine the different components of deep-RL algorithms, such as experience replay, auxiliary rewards and proxy tasks much like LEGO bricks. In the future we also plan to support multi-agent environments and Unity3D environments out of the box. Huskarl is in active development and contributions are welcome!

Check out the Github repository for more detail: https://github.com/danaugrs/huskarl