With a fleet of approximately 500,000 vehicles on the road equipped with what Tesla claims is full self-driving hardware, Tesla’s fleet is driving about as many miles each day – around 15 million – as Waymo’s fleet has driven in its entire existence. 15 million miles a day extrapolates to 5.4 billion miles a year, or 200x more than Waymo’s expected total a year from now. Tesla’s fleet is also growing by approximately 5,000 cars per week.

There are three key areas where data makes a difference:

Computer vision

Prediction

Path planning/driving policy

Computer vision

One important computer vision task is object detection. Some objects, such as horses, only appear on the road rarely. Whenever a Tesla encounters what the neural network thinks might be a horse (or perhaps just an unrecognized object obstructing a patch of road), the cameras will take a snapshot, which will be uploaded later over wifi. It helps to have vehicles driving billions of miles per year because you can source many examples of rare objects. It stands to reason that, over time, Teslas will become better at recognizing rare objects than Waymo vehicles.

For common objects, the bottleneck for Waymo and Tesla is most likely paying people to manually label the images. It’s easy to capture more images than you can pay people to label. But for rare objects, the bottleneck for Waymo is likely collecting images in the first place, whereas for Tesla the bottlenecks are likely just labelling and developing the software to trigger snapshots at the right time. This is a much better position to be in.

Tesla’s Director of AI, Andrej Karpathy, explains in this clip (taken from his Autonomy Day presentation) how Tesla sources images to train object detection:

Prediction

Prediction is the ability to anticipate the movements and actions of cars, pedestrians, and cyclists a few seconds ahead of time. Anthony Levandowski, who for years was one of the top engineers at Waymo, recently wrote that “the reason why nobody has achieved” full autonomy “is because today’s software is not good enough to predict the future.” Levandowski claims the main category of failures for autonomous vehicles is incorrectly predicting the behaviour of nearby cars and pedestrians.

Tesla’s fleet of approximately 500,000 vehicles is a fantastic resource here. Any time a Tesla makes an incorrect prediction about a car or pedestrian, the Tesla can save a data snapshot to later upload and add to Tesla’s training set. Tesla may be able to upload an abstract representation of the scene (wherein objects are visualized as colour-coded cuboid shapes and pixel-level information is thrown away) produced by its computer vision neural networks, rather than upload video. This would radically reduce the bandwidth and storage requirements of uploading this data.

Whereas images used to train object detection require human labelling, a prediction neural network can learn correlations between past and future just from temporal sequences of events. What behaviour precedes what behaviour is inherent in any recording (video or abstracted). Andrej Karpathy explains the process in the clip below:

Since there is no need for humans to label the data, Tesla can train its neural networks on as much useful data as it can collect. This means the size of its training dataset will correlate with its overall mileage. As with object detection, the advantage over Waymo isn’t just more data for predicting common behaviours, but the ability to collect data on rare behaviours seen in rare situations in order to predict those as well.

Path planning/driving policy

Path planning and driving policy refer to the actions that a car takes: staying centred in its lane at the speed limit, changing lanes, passing a slow car, making a left turn on a green light, nudging around a parked car, stopping for a jaywalker, and so on. It seems fiendishly difficult to specify a set of rules that encompass every action a car might ever need to take under any circumstance. One way around this fiendish difficulty is to get a neural network to copy what humans do. This is known as imitation learning (also sometimes called apprenticeship learning, or learning from demonstration).

The training process is similar to how a neural network learns to predict the behaviour of other road users by drawing correlations between past and future. In imitation learning, a neural network learns to predict what a human driver would do by drawing correlations between what it sees (via the computer vision neural networks) and the actions taken by human drivers.

Still frame from Tesla’s autonomous driving demo. Courtesy of Tesla.

Imitation learning recently met with arguably its greatest success yet: AlphaStar. DeepMind used examples from a database of millions of human-played games of StarCraft to train a neural network to play like a human. The network learned the correlations between the game state and human players’ actions, and thereby learned to predict what a human would do when presented with a game state. Using only this training, AlphaStar reached a level of ability that DeepMind estimates would put it roughly in the middle of StarCraft’s competitive rankings. (Afterward, AlphaStar was augmented using reinforcement learning, which is what allowed it to ascend to pro-level ability. A similar augmentation may or may not be possible with self-driving cars – that’s another topic.)

Tesla is applying imitation learning to driving tasks, such as how to handle the steep curves of a highway cloverleaf, or how to make a left turn at an intersection. It sounds like Tesla plans to extend imitation learning to more tasks over time, like how and when to change lanes on the highway. Karpathy describes how Tesla uses imitation learning in this clip:

As with prediction, it may be sufficient to upload an abstract representation of the scene surrounding the car, rather than upload video. This would imply much lower bandwidth and storage requirements.

Also as with prediction, no human labelling is needed once the data is uploaded. Since the neural network is predicting what a human driver would do given a world state, all it needs are the world state and the driver’s actions. Imitation learning is, in essence, predicting Tesla drivers’ behaviour, rather than predicting the behaviour of other road users that Teslas see around them. As with AlphaStar, all the information needed is contained within the replay of what happened.

Based on Karpathy’s comments about predicting cut-ins, Tesla can trigger a car to save a replay when it fails to correctly predict whether a vehicle ahead will cut into the Tesla’s lane. Similarly, Tesla may capture replay data when a neural network involved in path planning or driving policy fails to correctly predict the Tesla driver’s actions. Elon Musk has alluded to this capability (or something similar) in the past, although it’s not clear if it’s currently running in Tesla cars.

The inverse would be when a Tesla is on Autopilot or in the upcoming coming urban semi-autonomous mode and the human driver takes over. This could be a rich source of examples where the system does something incorrectly, and then the human driver promptly demonstrates how to do it correctly.

Other ways to capture interesting replays include: sudden braking or swerving, automatic emergency braking, crashes or collision warnings, and more sophisticated techniques in machine learning known as anomaly detection and novelty detection. (These same conditions could be also used to trigger replay captures for prediction or camera snapshots for object detection.) If Tesla already knows what it wants to capture, such as left turns at intersections, it can set up a trigger to capture a replay whenever the vision neural networks see a traffic light and the left turn signal is activated, or the steering wheel turns left.

Conclusion

Tesla has an advantage over Waymo (and other competitors) in three key areas thanks to its fleet of roughly 500,000 vehicles:

Computer vision

Prediction

Path planning/driving policy

Concerns about collecting the right data, paying people to label it, or paying for bandwidth and storage don’t obviate these advantages. These concerns are addressed by designing good triggers, using data that doesn’t need human labelling, and using abstracted representations (replays) instead of raw video.

The majority view among business analysts, journalists, and the general public appears to be that Waymo is far in the lead with autonomous driving, and Tesla isn’t close. This view doesn’t make sense when you look at the first principles of neural networks.

What’s more, AlphaStar is a proof of concept of large-scale imitation learning for complex tasks. If you are skeptical that Tesla’s approach is the right one, or that path planning/driving policy is a tractable problem, you have to explain why imitation learning worked for StarCraft but won’t work for driving.

I predict that – barring a radical move by Waymo to increase the size of its fleet – in the next 1-3 years, the view that Waymo is far in the lead and Tesla is far behind will be widely abandoned. People have been focusing too much on demos that don’t inform us about system robustness, deeply limited disengagement metrics, and Google/Waymo’s access to top machine learning engineers and researchers. They have been focusing too little on training data, particularly for rare objects and behaviours where Waymo doesn’t have enough data to do machine learning well, or at all.