Open RL Benchmark by CleanRL (https://github.com/vwxyzjn/cleanrl) provides benchmark of popular Deep Reinforcement Learning algorithms in 34+ games with a new level of transparency, openness, and reproducibility.

CleanRL is a library that provides high-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features. All of our implementation is benchmarked to ensure quality. We log all of our experiments using Weights and Biases so that you can check the following information:

hyper-parameters (check it at the Overview tab of a run)

training metrics (e.g. episode reward, training losses. Check it at the Charts tab of a run)

videos of the agents playing the game (check it at the Charts tab of a run)

system metrics (e.g. CPU utilization, memory utilization. Check it at the Systems tab of a run)

stdout, stderr of the script (check it at the Logs tab of a run)

of the script (check it at the Logs tab of a run) all dependencies (check requirements.txt at the Files tab of a run))

at the Files tab of a run)) source code (this is especially helpful since we have single file implementation, so we know exactly all of the code that is responsible for the run. Check it at the Code tab of a run))

(Currently not working. Public access is blocked by https://github.com/wandb/client/issues/1177) the exact commands to reproduce it (check it at the Overview tab of a run.

Additionally, we packaged our library with docker, which allows us to leverage AWS Batch to run thousands of experiments concurrently. This is a poor man's google scale. Tutorial coming up for 0.4.0 release.

Atari Results

gym_id apex_dqn_atari_visual c51_atari_visual dqn_atari_visual ppo_atari_visual BeamRiderNoFrameskip-v4 2936.93 ± 362.18 13380.67 ± 0.00 7139.11 ± 479.11 2053.08 ± 83.37 QbertNoFrameskip-v4 3565.00 ± 690.00 16286.11 ± 0.00 11586.11 ± 0.00 17919.44 ± 383.33 SpaceInvadersNoFrameskip-v4 1019.17 ± 356.94 1099.72 ± 14.72 935.40 ± 93.17 1089.44 ± 67.22 PongNoFrameskip-v4 19.06 ± 0.83 18.00 ± 0.00 19.78 ± 0.22 20.72 ± 0.28 BreakoutNoFrameskip-v4 364.97 ± 58.36 386.10 ± 21.77 353.39 ± 30.61 380.67 ± 35.29

Mujoco Results

gym_id ddpg_continuous_action td3_continuous_action ppo_continuous_action Reacher-v2 -6.25 ± 0.54 -6.65 ± 0.04 -7.86 ± 1.47 Pusher-v2 -44.84 ± 5.54 -59.69 ± 3.84 -44.10 ± 6.49 Thrower-v2 -137.18 ± 47.98 -80.75 ± 12.92 -58.76 ± 1.42 Striker-v2 -193.43 ± 27.22 -269.63 ± 22.14 -112.03 ± 9.43 InvertedPendulum-v2 1000.00 ± 0.00 443.33 ± 249.78 968.33 ± 31.67 HalfCheetah-v2 10386.46 ± 265.09 9265.25 ± 1290.73 1717.42 ± 20.25 Hopper-v2 1128.75 ± 9.61 3095.89 ± 590.92 2276.30 ± 418.94 Swimmer-v2 114.93 ± 29.09 103.89 ± 30.72 111.74 ± 7.06 Walker2d-v2 1946.23 ± 223.65 3059.69 ± 1014.05 3142.06 ± 1041.17 Ant-v2 243.25 ± 129.70 5586.91 ± 476.27 2785.98 ± 1265.03 Humanoid-v2 877.90 ± 3.46 6342.99 ± 247.26 786.83 ± 95.66

Pybullet Results

gym_id ddpg_continuous_action td3_continuous_action ppo_continuous_action MinitaurBulletEnv-v0 -0.17 ± 0.02 7.73 ± 5.13 23.20 ± 2.23 MinitaurBulletDuckEnv-v0 -0.31 ± 0.03 0.88 ± 0.34 11.09 ± 1.50 InvertedPendulumBulletEnv-v0 742.22 ± 47.33 1000.00 ± 0.00 1000.00 ± 0.00 InvertedDoublePendulumBulletEnv-v0 5847.31 ± 843.53 5085.57 ± 4272.17 6970.72 ± 2386.46 Walker2DBulletEnv-v0 567.61 ± 15.01 2177.57 ± 65.49 1377.68 ± 51.96 HalfCheetahBulletEnv-v0 2847.63 ± 212.31 2537.34 ± 347.20 2347.64 ± 51.56 AntBulletEnv-v0 2094.62 ± 952.21 3253.93 ± 106.96 1775.50 ± 50.19 HopperBulletEnv-v0 1262.70 ± 424.95 2271.89 ± 24.26 2311.20 ± 45.28 HumanoidBulletEnv-v0 -54.45 ± 13.99 937.37 ± 161.05 204.47 ± 1.00 BipedalWalker-v3 66.01 ± 127.82 78.91 ± 232.51 272.08 ± 10.29 LunarLanderContinuous-v2 162.96 ± 65.60 281.88 ± 0.91 215.27 ± 10.17 Pendulum-v0 -238.65 ± 14.13 -345.29 ± 47.40 -1255.62 ± 28.37 MountainCarContinuous-v0 -1.01 ± 0.01 -1.12 ± 0.12 93.89 ± 0.06

Other Results

gym_id ppo dqn CartPole-v1 500.00 ± 0.00 182.93 ± 47.82 Acrobot-v1 -80.10 ± 6.77 -81.50 ± 4.72 MountainCar-v0 -200.00 ± 0.00 -142.56 ± 15.89 LunarLander-v2 46.18 ± 53.04 144.52 ± 1.75

All training curves

Benchmarked Learning Curves Atari Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Atari

Benchmarked Learning Curves Mujoco Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Mujoco

Benchmarked Learning Curves Pybullet Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/PyBullet-and-Other-Continuous-Action-Tasks

Benchmarked Learning Curves Classic Control Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Classic-Control

Benchmarked Learning Curves Experimental Domains Metrics, logs, and recorded videos are at cleanrl.benchmark/reports/Others This is a rather challenging continuous action tasks that usually require 100M+ timesteps to solve. This is a self-play environment from https://github.com/hardmaru/slimevolleygym, so its episode reward should not steadily increase. Check out the video for the agent's actual performance (i.e. go check out cleanrl.benchmark/reports/Others ) This is a MicroRTS environment to build as many combat units as possible, see https://github.com/vwxyzjn/gym-microrts. These runs are created by https://github.com/vwxyzjn/gym-microrts/blob/master/experiments/ppo.py, which additionally implements invalid action masking and handling of multi-discrete action space for PPO. This is an experimental run of MontezumaRevengeNoFrameskip-v4 with PPO with RND (Random Network Distillation) by @yooceii, see vwxyzjn/cleanrl#25 and runs/j00qhu7d. We plan to officially include this run soon.

Experimental Domains