01. May 2017 Self-driving cars in the browser Using reinforcement learning, the goal of this project was to create a fully self-learning agent, that would be able to control a car in a 2D bottom-down environment. Written solely in JavaScript.

reinforcement learning,

simulation,

ddpg

Note: this works only in modern browsers, so make sure you are on the newest version 🤘

This is a project I have been working on for quite some time now. These cars learned how to drive by themselves. They got feedback on what good and what bad actions are based on their current speed as a form of reward. Powered by a neural network.

You can drag the mouse to draw obstacles, which the cars must avoid. Play around with this demo and get excited about machine learning!

The following is a more detailed description of how this works. You may stop reading here and just play with the demo if you're not interested in the technical background!

The demo is loading and is ready in one second... But we need JavaScript to work.

Concepts

A short introduction to a few reinforcement learning concepts.

Agent: The agent, in this case, is the driver of the car. Action: At each timestep t t t the agent has to take an action a t a_t a t ​ . This may for example be steering left, going faster or braking. State: The state s t s_t s t ​ is a vector that describes the environment the agent is currently in. It contains the information the agent uses to make a reasonable decision on what to do. After each action the state is updated to reflect the changes in the environment. Reward: After taking an action in a state, the agent receives a reward r t r_t r t ​ , which describes how good the action he took was. The goal of the agent is to maximise this reward. Reinforcement learning: Learning what actions to take in order to maximise a given reward function. In math words: learning a t = π ( s t ) a_t = \pi(s_t) a t ​ = π ( s t ​ ) that maximizes the cumulative future reward.

Neural networks

The agents learn by adjusting the weights of their neural network (function approximators). In this case, this involves two neural networks: one state to action net (3-layer, 150 neurons), one state + action to q-value net (2-layer, 200 neurons). The q-value describes how good an action is. By learning the second network, "the value network", you can obtain policy gradients, which you can then use to learn the first network. The first network, "the actor network", is now your decision maker. This algorithm is called "Deep Deterministic Policy gradient" or in short DDPG. Combining this with state-of-the-art techniques, such as prioritised experience replay buffers, ReLU non-linearites and the Adam learner, results in the cars you can see above. Even though this, at first, might seem reasonable, a lot of trouble with neural networks these days is the hyper-parameter search. There are at least a dozen of parameters you need to tune in order to achieve optimal results, which is kind of a drawback. In the future this might be overcome by automatic hyper-parameter search, which iterates over a set of hyper-parameters and finds the best.

Note however, that the demo you see above doesn't train the neural networks, therefore they are not learning (this is done offline).

Deep deter­min­is­tic pol­i­cy gra­di­ents (DDPG).

Sen­sors

The state (or the input to the neur­al nets) of the agent con­sists of two time-steps, the cur­rent time-step and the pre­vi­ous time-step. This helps the agent make deci­sions based on how things moved over time. For each time-step the agent receives infor­ma­tion about its envi­ron­ment. This includes 19-dis­tance sen­sors, which are arranged in dif­fer­ent angles. You can think of these sen­sors as beams, that stop when they hit an object. The short­er the beam, the high­er the input to the agent (0 – for no hit, 1 – for a very short beam). In addi­tion, a time-step con­tains the cur­rent speed of the agent. In total, the input to the neur­al net­works is 158-dimensional.

Imag­ine sit­ting in a room with a com­put­er, look­ing at 158-num­bers on the screen and hav­ing to press left or right in order to increase some kind of num­ber, name­ly the reward. That is what this agent is doing. Isn’t that crazy?

Explo­ration

A major issue with DDPG is explo­ration. In reg­u­lar DQN (deep Q‑networks) you have dis­crete actions from which you can choose from. So you can eas­i­ly mix up your action-state-space by epsilon-greed­i­ly ran­domis­ing actions. In con­tin­u­ous spaces (as the case with DDPG) this is not as easy. In this project I used dropout as a way to explore. This means drop­ping some neu­rons of the last lay­er of the actor net­work ran­dom­ly and there­fore obtain­ing some kind of vari­a­tion in actions.

Mul­ti-agent learning

In addi­tion to apply­ing dropout to the actor net­work, I put 4 agents into the vir­tu­al envi­ron­ment at the same time. All these agents share the same val­ue net­work, but have dif­fer­ent actors and there­fore have dif­fer­ent approach­es to dif­fer­ent states, thus every agent explores dif­fer­ent areas of the state-action space. All in all this result­ed in bet­ter and faster convergence.

If you want to hear more on the progress of the project as I add new fea­tures, I encour­age you to fol­low me on Twit­ter @janhuenermann! Addi­tion­al­ly feel free to share the project in social media, so more peo­ple can get excit­ed about AI!