I can’t shake the hunch that Tesla is going to use reinforcement learning to train its neural networks on path planning/driving policy tasks. This seems to be the only way you could hope to get 10x superhuman performance on these tasks. At least on the complex ones that can’t be handled by a relatively simple heuristic.*

*(I think the theory of many engineers going into self-driving cars circa 2010 was that when humans drive, we approximate heuristics. For example, we approximate a lane keeping heuristic that keeps the midpoint of the car aligned with the midpoint between the lane lines, but since the human brain isn’t a computer we eventually lose attention or fall asleep and stop implementing the heuristic. Robot cars that implement the heuristics with computers will therefore be superior.

In recent years, it is has become more evident that certain driving subtasks require predicting other road users’ behaviour and interacting with other road users, such as signalling intent by nudging into a lane. The more crowded and chaotic an urban setting is, the more predictive and interactive ability it requires.

This realization has led engineers to explore neural network-based alternatives to heuristics for path planning/driving policy tasks, namely imitation learning and reinforcement learning.

Some tasks like highway lane keeping, though, might genuinely be solveable with a heuristic like people originally thought.)

Karpathy said at Autonomy Day: “We’ve also even used it [simulation] for training, quite successfully so.” This could refer to either supervised learning or reinforcement learning (or both).

But what is the point of using reinforcement learning in a simulator? One possible reason is to train on counterfactual scenarios, or purely hypothetical ones. Another reason is to avoid real world risk. But the main reason is that simulation is a way to get a volume of experience that wouldn’t be possible in the real world.

For instance, OpenAI Five trained on 45,000 years of play. AlphaStar (I believe) had 60,000 years of simulated experience across all the different versions of its agent. Simulation is one way to cheaply and quickly accrue that amount of experience.

Illustration by Notfruit

Another way is having lots and lots of robots. Currently, Tesla has 460,000 Hardware 2 and 3 vehicles each driving 1 hour per day and accruing a collective 1,575 years of driving experience per month. That’s an annualized rate of about 19,000 years of experience per year.

A fleet of 1 million vehicles will accrue 3,425 years of experience per month, which annualizes to about 41,000 years of experience per year. You can get comparable scale to simulation by just having lots of robots.

To train on counterfactual or hypothetical scenarios, real world reinforcement learning can always be supplemented with simulated reinforcement learning (or even simulated supervised learning, à la ChauffeurNet). Doing it in the real world doesn’t preclude also doing it in simulation.

If the system is already somewhat safer than the median and mean human driver (thanks to some combination of imitation learning and heuristics), and if the system is constantly being monitored by a vigilant human in the driver’s seat, then the risk of doing real world reinforcement learning is minimized.

Researchers have also explored ideas for doing safe reinforcement learning with robots and cars. For example, Mobileye has proposed a system of hard constraints around what a vehicle is allowed to do. The system is called RSS (Responsibility-Sensitive Safety). Nvidia recently announced a similar system called SFF (Safety Force Field).

To make the leap from supervised learning/imitation learning to reinforcement learning, the biggest problem I foresee is reward engineering or reward learning. That is, what constitutes success or failure? What outcomes prompt reward or punishment? The simplest way to do it would be to just treat every human intervention as a punishment. Elon may have alluded to this in his interview with Lex Fridman, when he said “view all input as error”.

A more complicated way to determine the reward function would be to do reward learning from demonstration (e.g. T-REX). You produce ranked examples of better and worse driving, and use those to learn a reward function rather than try to exactly specify one manually.

So, I suspect Tesla’s approach to each individual driving task — such as changing lanes on the highway or making unprotected left turns at urban intersections — is going to be something like this: