And here’s the longer version:

It is 2016. Deep learning is everywhere. Image recognition can be considered kind of solved by convolutional neural networks and my research interests gravitate towards neural networks with memories and reinforcement learning.

Specifically, in a paper showed by Google Deepmind, it has been shown that it is possible to achieve human or even superhuman-level performance on a variety of Atari 2600 (a home game console released in 1977) games using a simple reinforcement learning algorithm called Deep Q-Neural Network. All that done by just observing the gameplay. That caught my attention.

One of the Atari 2600 games, Breakout. Trained using a simple RL algorithm. After millions of iterations, the computer agent plays at a super-human level.

I started running experiments on Atari 2600 games. As much as impressive it is, Breakout is not a complex game. One could define complexity by the degree of difficulty in connecting your actions (joystick) to results (score). A problem arises when one needs to wait a long time until an effect can be observed.

An illustration of the problem with more complex games. LEFT: Breakout (ATARI 2600) is very reactive, we get feedback very quickly; RIGHT: Mario Land (NINTENDO GAMEBOY) does not provide immediate information about consequences of an action, there can be stretches of irrelevant observations between important events

In order to make learning more efficient, you could imagine trying to transfer some knowledge from simpler games. This is what remains unsolved right now and is a hot research topic. A recently published challenge by OpenAI tried to measure just that:

Not only would having transfer learning capability make training faster, but I would even argue that some problems cannot be solved unless there is some prior knowledge present. What you need is data efficiency. Let’s take a look at Prince of Persia:

A short clip from Prince of Persia

There is no obvious score

If no action is performed it takes 60 minutes to end the game (58 minutes left in the animation).

Random actions

Could you try the exact same approach as in the Atari 2600 paper? I you think about it, how likely is it that you’ll get to the end by pressing keys at random?

This motivated me to contribute to the community by addressing this problem. So, in essence, have a chicken and egg problem — you need a better algorithm which allows you to transfer knowledge, but that requires research and experiments take a long time, because we don’t have a more efficient algorithm.

Transfer learning example: Imagine first learning a very simple game, such as the one on the lefthand side, then you preserve concepts such as ‘race’, ‘car’, ‘track’, ‘win’, and learn colors or 3D models. We say that the common concepts are ‘transferrable’ between those games. One could define similarity by the amount of transferrable knowledge between two problems. I.e. Tetris and F1 racing will not be similar.

I decided to do the second best thing then, avoid the initial slowdown by making the environments much, much faster. My goals were to have:

Faster environment (imagine you could finish Prince of Persia in 1/100th the time) and run 100.000 games at the same time

(imagine you could finish Prince of Persia in 1/100th the time) and run 100.000 games at the same time Better environment for research (focus on the problems, but not preprocessing, have variety of games)

Initially, I thought that the performance bottleneck may be to some extent related to the complexity of the emulator code (for example Stella’s codebase is quite big, plus it relies on C++ abstractions which are not the best choice when it come to emulators).

Consoles

Arcade Space Invaders

In total, I have worked on several platforms, starting with probably one of the first games ever (along with Pong) — the arcade ‘Space Invaders’, Atari 2600, NES and Gameboy. All that written in C.

The maximum frame-rate I could observe was about 2000–3000 FPS. In order to start seeing results of experiments, we need millions or billions of frames, so the gap was huge.

My Space Invaders running inside an FPGA — low speed debug mode, the counter in the FPGA shows clock cycles elapsed

I thought — what if we could hardware-accelerate those environment. For example, the original Space Invaders run on 8080 CPU operating at 1MHz. I was able to emulate a 40MHz 8080 CPU on a 3GHz Xeon. Not bad, but as a proof of concept, once put inside an FPGA, this went up to 400 MHz. This means 24000 FPS from a single instance — an equivalent of a 30GHz Xeon CPU! Did I mention that one you can fit around 100 8080 CPU in a mid-tier FPGA? That’s 2.4M FPS.

Hardware-accelerated Space Invaders, 100MHz, 1/4 of the full speed

Over one hundred tiny cores inside one Xilinx Kintex 7045 FPGA (bright colors, the blue patch in the middle is shared logic for display)

Irregular Execution Path

At this point, you might ask, what about GPUs? The short answer is you need MIMD parallelism, not SIMD. Back in my student years, I spent some time working on a GPU implementation of Monte Carlo Tree Search (MCTS was used in AlphaGo).

http://olab.is.s.u-tokyo.ac.jp/~kamil.rocki/rocki_ipdps11.pdf

At that time I have spent countless hours trying to make GPUs and other kind of SIMD hardware (IBM Cell, Xeon Phi, AVX CPUs) run this kind of code efficiently and failed. A few years ago I started to believe that I would be really nice to be able to design my own piece of hardware, aimed at RL kind of problems.

Multiple Instruction Multiple Data (MIMD) Parallellism

ATARI 2600, NES or GAMEBOY?

In total, I implemented 8080 with space invaders, NES, 2600 and gameboy. Here are some facts about those and their individual advantages.

NES Pacman

The Space Invaders was just a warmup. We got it to work, but it’s just that one game, so it is not very useful.

Atari 2600 is de facto standard in reinforcement learning research. The CPU (MOS 6507) is a simplified version of the famous 6502 and it’s more elegant in design and more efficient than 8080. I did not choose 2600 just because I think that there a certain limitations regarding the games and graphics.

I implemented NES (Nintendo Entertainment System) as well, it shares the CPU with 2600. Games are much much better than the ones on 2600. Both NES and 2600 suffer from overcomplicated graphics processing pipeline and many cartridge formats which need to be supported.

In the meantime, I rediscovered Nintendo Gameboy. This was what I was looking for.

Why is Gameboy so awesome?

1049 Classic games + 576 for Gameboy Color

Over 1000 games in total (Classic + Color), very wide range, all high quality, some of them very challenging (Prince), games can be somewhat grouped and assign difficulty for research on transfer learning and curriculum learning (for example there are variants of Tetris, racing games, Marios). Trying to solve Prince of Persia may require transferring knowledge from some other similar game which has explicit score (Prince doesn’t!)

Nintendo Gameboy is my transfer learning research platform of choice. In this chart I tried to group the games and plot them according to their difficulty (subjective judgement) and their similarity (concepts such as racing, jumping, shooting, variety of Tetris games; Has anyone ever played HATRIS?).

Gameboy classic has a very simple screen (160 by 144 2-bit color) which makes preprocessing simpler, we can focus on the important things. In case of 2600, even simple games have various colors. In addition to that Gameboy has much better way of displaying objects so no flickering or taking max of a few consecutive frames is required.

IBM Boot ROM + Super Marioland

No crazy memory mappers as in NES or 2600 case. You can get most of the games work with 2–3 mappers in total.

It turned out to be quite compact, overall I was able to complete the C emulator in under 700 lines of code and my Verilog implementation in approximately 500.

It comes with the same simple version of Space Invaders as the arcade version

Gameboy version of Space Invaders — Basically no preprocessing needed!

So, here it is, my Dot-Matrix Gameboy from 1989 and and FPGA version running through HDMI on a 4K screen.

An this is something that my old Gameboy cannot do:

Hardware accelerated Tetris, this is a realtime screen recording at 1/4 the max speed.

Is it actually useful?

Yes, it is. So far, I have tested it in a simple setting where there is an external policy network, which interacts with the Gameboy workers. More specifically I used distributed A3C (Advantage Actor Critic) algorithm and I’ll describe that part in a separate post. A colleague of mine also connected it to FPGA convnet and it works quite well. More to come in the upcoming post.

FPGA-NN communication

Distributed A3C setup

Mario land: Initial State. Pressing buttons at random does not get us far. Top right corner shows remaining time. If we get lucky and Goomba, we finish quickly as we touch Goomba. If not, it takes 400 seconds to ‘lose’.

Mario land: after less than 1 hour of gameplay, Mario has learned to run, jump and even discovers a secret room by going into a pipe.

Pac Man: after about one hour of training it can even finish the entire game once (eats all dots), then the games starts all over again.

Conclusion

I like to think of the next decade as a period when HPC and AI finally come together. I would like hardware which has some degree of customization allowed, depending on the AI algorithm of choice.

The Next Decade

Here the GB C Code https://github.com/krocki/gb

Bonus A — Debugging

This part probably deserves an article on its own, but for now, I am too tired.

I often get asked: what was the hardest part? Everything… it was very painful. There is no such thing as specification of gameboy to begin with. Everything what we know is the result of reverse engineering, and by that I mean: running some proxy task such as a game or any other snippet and observe if it executes. This is very different from standard software debugging as here the hardware which executes the code is being debugged. I had to come up with some ways of going through the process. Did I mention that it’s hard to see something when it runs at 100MHz? Oh, and there’s not printf.

One approach to implementing the CPU is to group instructions into some clusters which do more or less the same thing. This is much easier with 6502. LR35092 has much more `random` stuff and many edge cases. Here is the chart which I used while I worked on the Gameboy CPU. I adopted a greedy strategy of taking the largest chunk of instructions, implementing it and then crossing it out, then repeat. In this case 1/4 of the instructions are ALU, 1/4 are register loads which can be implemented relatively quickly. On the other side of the spectrum, there are some outliers such as `load from HL to SP, signed` which need to be handled separately

Debugging: you run a piece of code on the hardware you are debugging, record a log for your implementation and some other reference implementation (here I compared Verilog code, left vs my C emulator, right). Then you run diff on the logs to identify discrepancies (blue). One reason for using some automated way is that in many cases I found problems after millions of cycles of execution, where a single CPU flag caused a snowball effect. I tried many approaches and this seemed to be one of the most effective ones.

You need coffee! Lots of it.

These books are 40 years old — It’s incredible to go through them and see the world of computers through the eyes of the users then. I felt like a visitor from the future.

Bonus B — OpenAI request for research

Initially I wanted to approach the games from from the memory-view point as described in this post by OpenAI.

Surprisingly, getting Q-learning to work well when the inputs are RAM states has been unexpectedly challenging. This project might not be solvable. It would be surprising if it were to turn out that Q-learning would never succeed on the RAM variants of Atari, but there is some chance that it will turn out to be challenging.

Given the fact that Atari games use only 128B of RAM it was very appealing to process those 128B instead of entire screen frames. I was getting mixed results, so I started digging into this.

While I cannot prove that it is impossible to learn directly from memory, I can show that the assumption that the memory captures the entire state of the game, is wrong. Atari 2600 CPU (6507) uses 128B of RAM, but it has also access to extra registers which ‘live’ in a separate circuit (Television Interface Adapter, kind of a GPU). Those registers are used to store and process information about objects (paddle, missile, ball, colissions). In other words this will NOT be available when only RAM is considered. Similarly, NES and Gameboy have extra registers which are used for screen manipulation and scrolling. RAM alone does not reflect the full state of the game.

Only 8080 stores data into VRAM in a direct way, which would allow retrieval of the entire state. In other cases, the ‘GPU’ registers are plugged in between the CPU and Screen Buffer, outside RAM.

Trivia: If you’re doing research on the history of GPUs, this might be the very first ‘graphics accelerator’ — 8080 has an external shift register to move the invaders using a single command, offloading the CPU.

References

And many more…

Gameboy and NES are registered trademarks of Nintendo.

EOF