+-----------------------------------------+-----------------------+

| PPO from Spinning Up ( BipedalWalker-v2 ) |1000 Epochs (s +/- s.d)|

+-----------------------------------------+-----------------------+

| 3960x (40 threads) | 1448.4 +/-28.7 |

+-----------------------------------------+-----------------------+

| 3960x (20 threads) | 1504.6 +/-16.9 |

+-----------------------------------------+-----------------------+

| 3960x (10 threads) | 2737.5 +/-23.3 |

+-----------------------------------------+-----------------------+

| n1-highcpu-64 (40 threads) | 2315.3 +/-30.6 |

+-----------------------------------------+-----------------------+

As we can see in the figure above, the 3960x build is substantially faster than n1-highcpu-64 , even when running on half as many cores. In fact, dropping from 40 down to 20 cores doesn't seem to slow PPO down significantly at all. That suggests to a serial computation bottleneck, probably related to PPO's training overhead, which involves computing the KL divergence to stabilize policy changes. Presumedly that means you could run 2 separate PPO experiments simultaneously in much less than twice the time required for a single experiment.

Benchmark 2: Synaptolysis

Evolutionary Algorithms are sometimes referred to as “embarrassingly parallel” because they are so amenable to parallel implementations. Unlike the policy gradient-based learning algorithm in the first benchmark, evolutionary algorithms have less overhead because they don’t have to compute additional overhead like the KL divergence of the policies. Generally speaking, they don’t even have to compute gradients! The only place where a single-thread bottleneck is really necessary in a genetic algorithm is to sort and update the agent policies by best fitness. There is an interesting and somewhat counter-intuitive trade-off that occurs fairly often in reinforcement learning: sample efficient methods like PPO (and imitation, model-based, inverse RL, etc. to a greater extent) don’t need to experience as much simulator time to come up with a good solution, but in terms of wall time (and hence compute and energy expenditure) simple algorithms like EAs often do better. The competition and where this trade-off breaks down seems like a productive area to study to me.

In any case, in this benchmark the performance differences between the self-built PC and the GCP cloud instance were even greater, despite the fact that this algorithm implementation is able to take advantage of all 64 threads on a n1-highcpu-64 VM.

+--------------------------+----------------------+---------------+

| Evolving Pruned Networks | 200 Gen. (s +/- s.d) | 400 Gen. |

+--------------------------+----------------------+---------------+

| 3960x PC | 3491.7+/-251.8 | 7070.3+/-414.9|

+--------------------------+----------------------+---------------+

| n1-highcpu-64 | 6666.4+/-714.0 | - |

+--------------------------+----------------------+---------------+

You’ll notice that the fitness metrics are quite different between the 3960x build and the cloud instance. The fitness performance seemed to disfavor the cloud instance enough that I spent some time double-checking to make sure the commits were identical and I hadn’t overlooked a last-minute change in the code somewhere. As it turns out, repeatable, deterministic psuedo-random number generation in multithreaded applications can be a bit tricky, something I didn’t consider when setting up the training. Although the coordinator process initializes the population and handles all population updates (pulling from one random number generator initialized with the experiment’s seed), each worker process will instantiate its own version of the environment and the different environments will all pull random numbers for things like initialization state, and they may call their random number generators in any order, eventually yielding a sort of butterfly effect as small differences early on lead to big differences later. It’s easy to overlook a lack of determinism in multithreaded pseudo-random number generation, and this probably comes into play more often than is immediately obvious in RL/EA. Looking more closely at the first benchmark it’s obvious that Spinning Up’s PPO is also non-deterministic despite using identical seeds as well.

We can see in the second run of the experiment on GCP that training can differ significantly between different runs of the same experiment definition.

Conclusion

Saving money by building a high-end desktop with a modern multicore processor doesn’t mean sacrificing performance on reinforcement learning and evolutionary algorithms. In fact, for the benchmarks investigated here, comparable cloud compute can take more than 190% as long for some experiments. The benchmarking above above was entirely a CPUs to vCPUs comparison: my pruned evolution algorithm is written entirely in numpy and I didn't find anywhere in the new PyTorch backend in spinning up that moves data to the GPU. I would expect that a CPU/GPU-dependent training algo would widen the gap in performance-per-dollar even further, if anything, based on the benchmarks and cloud comparisons I've seen before. In future projects I will investigate how best to leverage CPU and GPU (or other accelerators) together for more realistic RL problems.

OpenAI’s cloud bill as a non-profit was more than $7.9 million in 2017 according to their 990 tax form, and you can bet they’re not paying retail prices. From news like that it would be easy to conclude RL is only open to the deep-pocketed elite, but I hope this post will encourage a person or two somewhere, or maybe even a small team, to realize that you don’t necessarily need a huge cloud compute budget to learn and do good work in RL/EA. And, as I mentioned earlier, if anyone runs these benchmarks on their own build I’d be interested to read about the results. I’m especially curious as to how a “Baby Threadripper” 3950x CPU system stacks up. The 3950x is spec’ed a lot slimmer than the 3960x, but at half the price and given the performance advantage of the 3960x over n1-highcpu-64 I'd be surprised if it doesn't offer good value against general-use cloud VMs.

Addendum: Spinning Up with all the Cores

Earlier I mentioned that I was having trouble utilizing more than 40 cores for a PPO training run with 40000 steps. Taking some advice from Spinning Up’s author/maintainer I was later able to correct the problem and train with all 64 threads on n1-highcpu-64 and 48 threads on the home build. If your numerology is good you’ll notice the relationship between 40 cores — 40k steps, and that will give you a pretty telling clue as to the solution. As it turns out, Spinning Up’s PPO runs into problems when every worker doesn’t get to finish at least one episode, and an episode is about 1000 steps for many gym environments including BipedalWalker-v2 . Bumping the number of steps up to at least 1000 for each thread you want to utilize fixes the problem:

python -m spinup.run ppo --hid "[32,32]" --env BipedalWalker-v2 --exp_name gcp_benchmark_1ksteps --gamma 0.999

--epochs 1000 --steps_per_epoch 64000 --seed 13 42 1337 --cpu 64

As you might surmise from the finish times for the original experiments with 40 and 20 threads on the Threadripper, it’s more efficient to give each thread a few more steps to manage in each rollout. This allows us to gather a pretty enormous rollout, perhaps more than 100k total steps per epoch, in order to sample more episodes in less time. While giving each worker plenty to do with several thousand steps per epoch you’ll end up using more total environment interactions, but your training algorithm will move through these samples much more quickly and you’ll end up solving the task sooner. On the other hand, more steps per epoch may stabilize training but in a way that wastes PPO’s sample efficiency enhancements. PPO updates the policy multiple times until it reaches a threshold KL divergence (i.e. the policy is entering unknown, potentially catastrophic territory), and rolling out a huge number of steps per epoch often leads to cutting off updates early due to a high KL.

With all that in mind, I also used a n1-highcpu-64 instance for a training run with 2000 steps per thread per epoch and 250 epochs (1/2 the total environment interactions of the original), and 2k steps per epoch on the 3960x (keeping the total environment interactions the same).

python -m spinup.run ppo --hid "[32,32]" --env BipedalWalker-v2 --exp_name gcp_benchmark_1ksteps --gamma 0.999

--epochs 250 --steps_per_epoch 128000 --seed 13 42 1337 --cpu 64

and for the Threadripper:

python -m spinup.run ppo --hid "[32,32]" --env BipedalWalker-v2 --exp_name gcp_benchmark_1ksteps --gamma 0.999

--epochs 666 --steps_per_epoch 96000 --seed 13 42 1337 --cpu 64

+---------------------------+-----------------+-------------+

| Full-thread PPO | Full training |Half training|

+---------------------------+-----------------+-------------+

| 3960x (64k steps/epoch) | 1912.2+/-76.4 s | - |

| 3960x (96k steps/epoch) | 1914.5+/-54.3 s | - |

| gcp (64k steps/epoch) | 2703.4+/-26.6 s | - |

| gcp (128k steps/epoch) | - |1530.8+/-3.9s|

+---------------------------+-----------------+-------------+

With better thread utilization the GCP VM is much closer to finishing in the same amount of time as the 3960x (although still slower). It turned out that in this case 2k steps per worker per epoch is no faster than 1k, and for the n1-highcpu-64 cloud VM it’s actually a bit slower. The latter effect is probably because the lower single-thread performance on the cloud VM tightens the serial bottleneck as compared to the Threadripper and in keeping with Amdahl’s law.

I’ll add a few additional thoughts on these benchmarks. It probably would have been more fair to compare the Threadripper build’s performance to that of a virtual machine of GCP’s C2 instance typ.e C2 has a higher clock rate (base rate of 3.8 GHz) better suited for high-performance computing tasks and costs about 33% more than comparable n1 instances. However GCP rejected my request to use C2 instances, and so for now these benchmarks will stand. I should also mention that I spent between 3% and 4% of the total build budget on cloud compute, just for the sake of making the comparisons in this article. Running a pre-defined, formal experiment is one thing, but it’s a lot more comfortable iterating your projects up to signs of life when you aren’t paying cloud compute premiums to do so.

Build Guides and Resources for Building Your Own Deep Learning PC

There have been a lot of good blog posts about building deep learning computers with consumer hardware, and I’ve found their build notes, install guides, and cloud cost breakdowns useful over the years. A few of them are linked below, with my thanks to the authors. Not that my thanks are linked below, that’s more of an abstract concept not amenable to hypertext, but I’m sure you get the idea. Also thanks to you for reading. As a reward for reading the whole thing, the part you’ve all been waiting for is coming up next: bloopers.

Bloopers

Most mutations do not improve an organism’s fitness and these genetic algorithm training excerpts are no exception.