Days 11–14 of the OpenAI Retro Contest

Digging into the PPO2 baseline code.

It didn’t take long for my eyes to start glazing over when trying to go through the Rainbow DQN baseline implementation for the OpenAI Retro Contest. The PPO2 code seems to be a bit easier to understand, so it seems like a better stepping stone in my journey of AI retro gaming understanding.

I felt like my line by line approach in understanding was part of what did me in on rainbow code, so for this baseline code I am going to focus on the execution path, and less on things like import statements.

Background Reading

I started off by reading through the OpenAI paper on Proximal Policy Optimization Algorithms. The paper is pretty equation heavy with things like gradient estimators that seem more useful for theory than practice.

It’s pronounced g-hat right?

In fact, the equations seem to go on for the entire paper, so I set out to find something better to teach me what policy gradients were all about. Scholarpedia gave a definition of Policy Gradient Methods:

Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent.

The first thing I had to look up in that definition was gradient descent, but that seems pretty easy to understand as a method for getting to optimal values most quickly (or steeply). To learn more about what a “policy” meant in this context I found these slides (and video) from David Silver’s course at University College London/Google. Policies seem to be generalized rules for obtaining rewards in a given system, and gradient descent is the strategy to find the best polices/parameters most quickly.

The Code

Now that I kind of know what the code was supposed to be doing, I am eager to see what the code looks like especially around policy discover/creation. The main function in the ppo2-agent.py was pretty concise:

"""Run PPO until the environment throws an exception.""" I am guessing that these exceptions, that the code can run into, could take the form of a traditional code exception or error, some kind of maximum number of runs, or perhaps even creating a model that is deemed “successful enough”.

I am guessing that these exceptions, that the code can run into, could take the form of a traditional code exception or error, some kind of maximum number of runs, or perhaps even creating a model that is deemed “successful enough”. config = tf.ConfigProto() This appears to configure TensorFlow to use the current system hardware from skimming the definition, and I imagine that it is a common piece of boilerplate for most Tensorflow code.

This appears to configure TensorFlow to use the current system hardware from skimming the definition, and I imagine that it is a common piece of boilerplate for most Tensorflow code. config.gpu_options.allow_growth = True Here the above config is overridden (or perhaps ensured to be set) to allow the gpu memory to be allocated progressively over time instead of being allocated all at once in the beginning. I might have to configure these differently to run locally with my macbook’s cpu.

Here the above config is overridden (or perhaps ensured to be set) to allow the gpu memory to be allocated progressively over time instead of being allocated all at once in the beginning. I might have to configure these differently to run locally with my macbook’s cpu. with tf.Session(config=config): I wasn’t that familiar with with but this handy answer on StackOverflow cleared things up for me. It seems like a really useful language feature. Here we are running the following code after running the __enter__() code for TensorFlow and then when the indented code is finished it will run TensorFlow’s __exit__() function. I think that code is here and mainly deals with establishing a session.

The rest of the code is configuration for one method called ppo2.learn() and all of the different arguments it takes. From the import statements at the top, the source code for the function signature can be found here. I’ll go through the arguments to try and understand what is going on.

policy=policies.CnnPolicy This is another piece of code mentioned in the imports from the same repo as the rest of the ppo2 implementation. Something to note about this object is the step() and value() functions. I imagine that they are necessary for the training as the other policies incorporate them as well.

This is another piece of code mentioned in the imports from the same repo as the rest of the ppo2 implementation. Something to note about this object is the and functions. I imagine that they are necessary for the training as the other policies incorporate them as well. env=DummyVecEnv([make_env]) Still from the same baseline code. I couldn’t find anything that explained what the “VecEnv” might stand for in OpenAI’s PPO paper, so it must be more general (but surely seems like VectorEnvironment). Looking back at the definition for ppo2.learn() I came to the conclusion that it must be an “environment” that contains an “observation space” and an “action space”. Is this the same as the game environments that I have been using to watch and run the Sonic rom? I took a peek at sonic_util.py since that is where the make_env comes from and the rest of the code for VecEnv, and discovered that indeed, it appears to be a “Vectorized Environment” and this dummy function initializes most of the data as a bunch of zeros.

Still from the same baseline code. I couldn’t find anything that explained what the “VecEnv” might stand for in OpenAI’s PPO paper, so it must be more general (but surely seems like VectorEnvironment). Looking back at the definition for I came to the conclusion that it must be an “environment” that contains an “observation space” and an “action space”. Is this the same as the game environments that I have been using to watch and run the Sonic rom? I took a peek at since that is where the comes from and the rest of the code for VecEnv, and discovered that indeed, it appears to be a “Vectorized Environment” and this dummy function initializes most of the data as a bunch of zeros. nsteps=4096 This code seems the most obvious yet! From this line nsteps determines how many times the model runs it’s step() function, which appears to be how many times the policy’s step() function (in our case policies.CnnPolicy.step() ) gets ran.

This code seems the most obvious yet! From this line determines how many times the model runs it’s function, which appears to be how many times the policy’s function (in our case ) gets ran. nminibatches=8 Reading into the code there seems to be some interplay between the number of steps, batches, and environments created. nminibatches seems to primarily adjust the number of, or maybe proportion of training that is done, and must divide into the number of environments multiplied by nsteps evenly.

example hyperparameters from OpenAI’s paper

lam=0.95 I started out thinking that this value was the learning rate, until I saw the lr parameters further down. The paper refers to it as the GAE parameter, but I think the best description I found was with was from Tom Breloff:

The hyperparameter γ allows us to control our trust in the value estimation, while the hyperparameter λ allows us to assign more credit to recent actions.

gamma=0.99 Tom’s summary also seems like a good explanation for gamma as well. I’ll have to play around with these to see what impact they have.

Tom’s summary also seems like a good explanation for gamma as well. I’ll have to play around with these to see what impact they have. noptepochs=3 Based on this line I’m guessing that this is the number epochs and each epoch is some kind of larger iteration factor. It doesn’t seem to be as linked to other variables as the number of steps.

Based on this line I’m guessing that this is the number epochs and each epoch is some kind of larger iteration factor. It doesn’t seem to be as linked to other variables as the number of steps. log_interval=1 Only found in one line, it looks like this controls how many updates to make before logging an output.

Only found in one line, it looks like this controls how many updates to make before logging an output. ent_coef=0.01 Looking here, this appears to be a factor that is multiplied by the entropy , some value that TensorFlow spits out. I’m guessing that it impacts the amount of random influence on the game.

Looking here, this appears to be a factor that is multiplied by the , some value that TensorFlow spits out. I’m guessing that it impacts the amount of random influence on the game. lr=lambda _: 2e-4 This is surely the learning rate, interesting that it is represented as a function instead of the value directly, perhaps it offers more flexibility for dynamic learning rates?

This is surely the learning rate, interesting that it is represented as a function instead of the value directly, perhaps it offers more flexibility for dynamic learning rates? cliprange=lambda _: 0.1 Another function, the clip range puts limits on the gradient for gradient descent.

Another function, the clip range puts limits on the gradient for gradient descent. total_timesteps=int(1e7) This looks like the total number of events the network goes through, and is the dividend when grouping into batches.

Well that was quite the research into the function signature and the rest of the PPO2 agent code. I haven’t been able to get the code up and running yet (I’m on some slow wifi and can’t build my Docker containers. Next up for team Bobcats, I think we will try to get this agent up and running and then try out some hyperparameter tuning.