If you haven’t seen the video. Check it out.

Reinforcement learning has seen a lot of success over the past years. We have seen AIs beat pro players in Go, Dota, and Starcraft all by using reinforcement learning. The former go champion Lee Se-dol even quit the game altogether after his loss against the superior AI.

The Problem with Reinforcement Learning

But Reinforcement Learning has a huge Problem — its success is limited to virtual environments. No AI manages to navigate the real world as a human does. Even a 2 year old is better at this than our most sophisticated AI’s. Of course, it is a really complicated problem — it took evolution 4 Billion years to create humans and I still manage to look for my keys for ten minutes just to realize they were in my pockets all along so it's not like intelligent life has reached its peaks.

Nonetheless, humans somehow manage to learn complicated tasks without dying. If humans would learn exactly as our algorithms do, they would have to drive off a cliff thousands of times before realizing that staying on the road may not be such a bad idea, in the first place.

And it gets worse, these algorithms are so data-hungry that a human lifespan would not be enough time to learn even a handful of somewhat difficult tasks. The Dota AI, for example, played 40.000 years of the game before it was able to beat a Pro. Still impressive of course, but speeding up time just works better in virtual environments.

So how did OpenAI manage to control a physical arm using Reinforcement Learning without using the benefits of a virtual environment?

Well, they didn’t. They used a virtual simulation. But can a Simulation be so accurate that its results can just be transferred to the real world? OpenAI concluded no, the real world is too complicated and things like friction and elasticity are too hard to accurately measure and simulate.

This is the core problem they are trying to tackle with this experiment. This is the sim2real transfer problem and it describes the challenge of applying knowledge learned in a simulation to the real world.

Automatic Domain Randomization (and why it’s easier to understand than it sounds)

Their solution to this problem is a method called Automatic Domain Randomization (ADR). The idea of ADR is to randomly generate more and more difficult environments with constantly changing factors as the performance of AI improves. This forces the AI to learn a general strategy that works across all the randomly generated environments, which in theory leads to such a robust AI that the results can be translated into the real world.

Let’s go a bit more into detail. It sounds very complicated but it is pretty simple. Let me introduce someone to you — he is the ADR Algorithm, he’s like an evil teacher.

Great illustration

He likes to see his students struggling. And the AI in this case — is his student. In front of him are a few knobs. The first one controls the size of the Rubiks Cube, the next the friction of the hands and the last one how much the cube weighs. There are a few more parameters but that's not the important part.

As the AI begins to train, it is still struggling so ADR does nothing. The knobs stay at a fixed position. Then at some point, the AI is quite good and manages to solve the cube most of the time. As ADR does not like to see the AI succeed he begins turning the knobs randomly. But in reality, even though ADR is evil, his heart is in a good place. So he doesn’t go all out and turns the knobs just a bit randomly. Enough to make the AI struggle again but not too random as this would make learning impossible. This goes on like this forever — as the AI improves ADR increases its range of randomness.

How the size of the Rubik’s Cube changes over time.

Okay now we understand how ADR works but what about the actual AI controlling the Robot Arm. I’ll give you a short summary of their setup.

Under the hood

The magic sauce that makes this work is a combination of: