Tesla’s Computer Vision Master Plan

Creating a fully self-driving car with cameras, deep neural networks, and data from customers’ cars

I became a Tesla investor shortly after the mind-blowing Hardware 2 announcement on October 19, 2016. From that day forward, Tesla started equipping every car it produced with a sensor suite designed for full self-driving. Along with cameras, radar, and ultrasonic sensors, Tesla also included an upgradable onboard computer dedicated to the perception and planning tasks that full self-driving requires.

Combined with the long-standing ability to deliver over-the-air software updates to its cars, Tesla created something that took my breath away: a turnkey self-driving car. Around 150,000 Hardware 2 Teslas are currently driving on roads around the world. These otherwise conventional vehicles are a software update away from becoming fully self-driving cars.

The truly brilliant part of this plan is not the ability to instantly create an enormous fleet of self-driving taxis, although that is certainly a virtue. It’s the unparalleled, unprecedented flood of data that is rushing in through the sensors of these 150,000 or so vehicles, coupled with other data like GPS coordinates and driver input. This offers a scale of real world testing and training that is new in the history of computer science.

For comparison, Alphabet subsidiary Waymo has a computer simulation that contains 25,000 virtual cars. What this really means is that Waymo has 25,000 simulation scenarios running in parallel at any given time. These scenarios generate data from 8 million miles of simulated driving per day. I estimate that Tesla’s 150,000 real cars drive 5.5 million miles per day. In a few months, Tesla’s fleet will grow enough that its daily real world miles will match and exceed Waymo’s simulated miles.

Real world data is of course incomparably more valuable than simulation data. Indeed, real world data is the bottleneck that prevents companies from running an arbitrary large number of useful simulation scenarios. What Tesla has created is a technological wonder: “a large, distributed, mobile data centre” for collecting real world driving data on a scale that can otherwise only be grasped at in simulation.

The reason to collect so much data is to feed the deep neural networks that underpin a car’s ability to perceive. Here’s a brief background. Deep neural networks started to gain popularity in 2012, after a deep neural network won the ImageNet Challenge, a computer vision contest focused on image classification. For the first time in 2015, a deep neural network slightly outperformed the human benchmark for the ImageNet Challenge. (In an interesting twist, the human benchmark is AI researcher Andrej Karpathy, who is now Director of AI at Tesla.) The ImageNet Challenge tests for only a narrow subset of the capabilities of human visual perception. Still, the fact that computers can outperform humans on even some visual tasks is exciting for anyone who wants computers to do things better than humans can. Things like driving.

More data is better

The performance of deep neural networks tends to improve the more data that a network is trained on. As ARK Invest writes in its report on deep learning (a term synonymous with deep neural networks):

The performance of deep learning programs is correlated highly to the amount of data used for training. While the performance of other machine learning algorithms plateau with more data, those associated with deep learning continue to scale with more training data, as shown below. Thanks to the internet’s size and scale, deep learning has thrived with access to very large datasets at a minimal cost.

The Internet is one place to get very large datasets. For real world robotics applications like self-driving cars, that data has to be collected in the field. That’s why it’s so important that Tesla equipped all its cars with the hardware necessary to collect that data out in the world.

To put this into more concrete terms, Google conducted a study where it trained deep neural networks on the ImageNet dataset of 1 million images and its own dataset of 300 million images. The study found “a logarithmic relationship between performance on vision tasks and the amount of training data”. So, performance continues to increase even when the dataset grows 300x, although with “with decreasing marginal performance as the dataset grows.”

Crucially, the Google researchers argue that “it is highly likely that these results are not the best ones you can obtain when using this scale of data.” That’s because the hyper-parameters used to train these deep neural networks are based on prior experience with the ImageNet dataset, which is 1/300th the size. With new hyper-parameters optimized for massively larger datasets, better performance might be possible. If a company like Tesla ever acquired 300 million pieces of data to train its deep neural networks, it would be incentivized to invest the “considerable computational effort” required to optimize its hyper-parameters.

Also of importance is the Google researchers’ finding that “to fully exploit 300M images, one needs higher capacity (deeper) models.” A deeper neural network has more layers of artificial neurons. This finding is particularly interesting given that Tesla hired Jim Keller, a microprocessor engineer who previously worked at AMD and Apple, to design custom microprocessors for running deep neural networks. Just as GPUs allow cars, drones, and robots to run deep neural networks that would be infeasible to run in real time using CPUs, custom microprocessors can run much deeper neural networks that would be infeasible to run on GPUs. These deeper neural networks could, in theory, fully exploit much larger datasets.

As mentioned above, Tesla designed its Hardware 2 cars to allow the onboard computer to be easily upgraded at a service appointment. This means that all existing Hardware 2 cars could be given the custom microprocessors. While the upgrade would be costly to implement, the return on investment would likely be excellent, given the immense financial opportunity that self-driving cars represent.

Breaking lidar’s mystique

The number one criticism of Tesla’s self-driving strategy is that the Hardware 2 sensor suite doesn’t include lidar, and therefore (in the eyes of critics) is insufficient for full self-driving. Autonomy-grade lidar is prohibitively expensive, so it’s not possible for Tesla to include it in its production cars. As far as I’m aware, no affordable autonomy-grade lidar product has yet been announced. It looks like that is still years away.

Lidar has accrued an aura of magic in the popular imagination. Lidar is thought of as the secret sauce of self-driving cars, and its prohibitive cost is seen as one of the main obstacles to making self-driving cars affordable and widespread. Part of this is just sampling bias. Waymo is the oldest and best-known self-driving car company, and it strongly believes in lidar. Most people have only heard one side of the argument.

Perhaps part of lidar’s aura is also psychological. It is easier to swallow the new and hard-to-believe idea of self-driving cars if you tell the story that they are largely enabled by a cool, futuristic laser technology that gives cars a complete awareness of their environment. It is harder to swallow the idea that if you plug some regular ol’ cameras into a bunch of deep neural networks, somehow that makes a car capable of driving itself through complicated city streets. And yet this must be at least partially true, since we’ve seen Tesla demo full self-driving using only cameras:

My strong hunch is that this is the crux of the matter: it’s tempting to believe that hardware is the secret sauce, but really it’s software. Primarily, it’s deep neural networks.

Lidar is lauded for its high spatial precision. But cameras can provide high spatial precision too, given the right software. Lidar has a precision of 1.5 centimetres (0.6 inches), about the width of a finger. In one study, cameras equipped to a car had a precision of 10 centimetres (3.9 inches), which is about the width of a finger plus the length of a credit card. That seems like an allowable amount of imprecision, and it isn’t even the theoretical best cameras can do. It’s just one good stab at the problem by a small team of academic researchers.

Lidar also has a perceptual weakness. It functions poorly in heavy rain, snow, or fog. The laser pulse hits a raindrop, snowflake, or fog, and the snow or water causes it to refract. The laser pulse gets scattered. It may be possible to solve or mitigate this problem in software, but I’ve seen very little research on this front. If lidar really is the secret sauce for self-driving cars, we may simply have to accept that self-driving cars won’t work in heavy rain, snow, or fog. On the other hand, if cameras are sufficient for self-driving cars to drive in all the same weather conditions that humans can, then lidar isn’t really necessary after all.

Other sensors do better in inclement weather. Like other companies, Tesla pairs its cameras with radar, which can see through rain, snow, and fog, as well as dust and other occlusions. Radar waves also bounce, allowing a self-driving car to see past an occluding vehicle by bouncing a radar signal underneath it. This gives Tesla’s vehicles a complimentary sensor capability and sensor redundancy. If a car’s cameras don’t pick up on an obstacle ahead either because it’s visually occluded or the cameras simply fail in that instance, the radar input might be enough for the car to hit the brakes.

Lidar’s perceptual acuity has been widely celebrated and marvelled at. If you follow self-driving cars, you’ve probably seen images like this:

We see a rich, detailed 3D representation of a car’s environment using lidar data. It makes intuitive sense how a car could use this to navigate through streets, stop in the right places, turn in the right directions, and not hit anything. But it turns out that deep neural networks can take input from a regular ol’ monocular camera and produce a representation of the world that is equally impressive, if not more so:

Using just a camera, the car can detect and classify individual objects like cars and people (bottom right quadrant). It can estimate depth (bottom left quadrant). And it can semantically segment its field of view into regions like road, sidewalk, median, traffic, crowd, and sky (top right quadrant). To my non-expert eyes, cameras with the right software can do the same perceptual heavy-lifting as lidar.

HD maps

HD maps add a layer of redundancy to cameras’ real time perceptual capabilities. HD maps are pre-compiled 3D representations of the fixed elements of the driving environment. Kind of like Google Street View, but way more hardcore. lvl5, a startup making HD maps with cameras, claims its best maps can accurately localize objects to within 10 centimetres or less.

Here’s a 3D point cloud map of a street corner, created by stitching together some iPhone photos. It isn’t a full HD map, but it’s a handy visual aid. (On mobile, open this link to see the image.)

An HD map would look similar, but everything in the whole environment would the same resolution as the stop sign and the hydrant, or better.

Tesla can compile HD maps by using the 360-degree cameras on its Hardware 2 cars to capture images of everywhere they drive. This would cover a lot of ground, and allow maps to be updated on a frequent basis, perhaps daily. Changes to the environment would get mapped quickly.

HD maps provide redundancy to a car’s real time sensor data. The closest counterpart for a human driver would be your memory of an area you’re familiar with. You’re less likely to make a mistake when driving in a familiar area because you’re relying on both real time perception and memory, not just real time perception. So too with self-driving cars. HD maps are the cars’ collective memory of what the world looks like.

This applies to fixed objects like lane dividers, but also in theory to unfixed objects like parked cars and moving objects like pedestrians. Wherever the car’s real time sensor data deviates from the HD maps’ expectation of what it should see, it can assign a higher probability that there is an obstacle like a parked car or a pedestrian in that spot.

Enhanced Autopilot’s evolving capabilities

Tesla’s full self-driving technology is not yet available to the public, but customers have access to Enhanced Autopilot, its advanced driver assistance software for Hardware 2 cars. Enhanced Autopilot partially automates certain elements of the driving task. It requires driver supervision and intervention if it starts making a mistake.

Enhanced Autopilot serves as a real world test of the perceptual capabilities of Tesla’s deep neural networks. For example, this video of Enhanced Autopilot navigating an improvised lane in a construction zone suggests to me the capability to semantically segment driveable roadways without the use of lane lines:

In my mind, progress on Enhanced Autopilot will serve as a proxy for Tesla’s progress on full self-driving. That’s because Enhanced Autopilot is, in theory, a subset of full self-driving, using the same hardware and relying on the same fundamental perceptual capabilities.

The master plan, in brief

In brief, Tesla’s computer vision master plan is:

Equip all production cars with cameras and radar. Train deep neural networks with the flood of data from production cars. Supplement real time sensing with HD maps, compiled from images taken by production cars. Deliver incremental Enhanced Autopilot updates over the air to Tesla customers. Eventually, deliver updates that enable full self-driving. At some point, produce a custom microprocessor for deep neural networks that all production cars can be upgraded to. Possibly use this hardware to run deeper neural networks than would otherwise be practical in a car context.

In my mind, this is the most ambitious plan to develop self-driving cars that anyone on Earth is pursuing. I am watching with bated breath to see if it works. I’ve put money and hope on the line.

If successful, Tesla will catapult forward a technology that could save over 1 million lives per year and create a global industry the size of the entire economy of China. Cities will reclaim millions of acres of land devoted to parking. Electric vehicle adoption will radically accelerate, mitigating climate change and making cities cleaner and quieter. Point-to-point automotive transportation will become accessible to everyone, regardless of age, income, or disability. Drivers will free up hundreds of hours per year – time that was previously spent driving.

That’s a future I want to live in. I hope to see it come alive.