By: Shuyang Cheng and Gabriel Bender*

At Waymo, machine learning plays a key role in nearly every part of our self-driving system. It helps our cars see their surroundings, make sense of the world, predict how others will behave, and decide their next best move.

Take perception: our system employs a combination of neural nets that enables our vehicles to interpret sensor data to identify objects and track them over time so it can have a deep understanding of the world around it. The creation of these neural nets is often a time-consuming task; optimizing neural net architectures to achieve both the quality and speed needed to run on our self-driving cars is a complex process of fine-tuning that can take our engineers months for a new task.

Now, through a collaboration with Google AI researchers from the Brain team, we’re putting cutting-edge research into practice to automatically generate neural nets. What’s more, these state-of-the-art neural nets are higher quality and quicker than the ones manually fine-tuned by engineers.

To bring our self-driving technology to different cities and environments, we will need to optimize our models for different scenarios at a great velocity. AutoML enables us to do just that, providing a large set of ML solutions efficiently and continuously.

Transfer Learning: Using existing AutoML architectures

Our collaboration started out with a simple question: could AutoML generate a high quality and low latency neural net for the car?

Quality measures the accuracy of the answers produced by the neural net. Latency measures how fast the net provides its answers, which is also called the inference time. Since driving is an activity that requires our vehicles to use real-time answers and given the safety-critical nature of our system, our neural nets need to operate with low latency. Most of our nets that run directly on our vehicles provide results in less than 10ms, which is quicker than many nets deployed in data centers that run on thousands of servers.

In their original AutoML paper[1], our Google AI colleagues were able to automatically explore more than 12,000 architectures to solve the classic image recognition task of CIFAR-10: identify a small image as representative of one of ten categories, such as a car, a plane, a dog, etc. In a follow-up paper[2], they discovered a family of neural net building blocks, called NAS cells, that could be composed to automatically build better than hand-crafted nets for CIFAR-10 and similar tasks. With this collaboration, our researchers decided to use these cells to automatically build new models for tasks specific to self-driving, thus transferring what was learned on CIFAR-10 to our field. Our first experiment was with a semantic segmentation task: identify each point in a LiDAR point cloud as either a car, a pedestrian, a tree, etc.

One example of a NAS cell. This cell processes inputs from the two previous layers in a neural net.

To do this, our researchers set up an automatic search algorithm to explore hundreds of different NAS cell combinations within a convolutional net architecture (CNN), training and evaluating models for our LiDAR segmentation task. When our engineers fine-tune these nets by hand, they can only explore a limited amount of architectures, but with this method, we automatically explored hundreds. We found models that improved the previously hand-crafted ones in two ways:

Some had a significantly lower latency with a similar quality.

Others had an even higher quality with a similar latency.

Given this initial success, we applied the same search algorithm to two additional tasks related to the detection and localization of traffic lanes. The transfer learning technique also worked for these tasks, and we were able to deploy three newly-trained and improved neural nets on the car.

End-to-End Search: Searching for new architectures from scratch

We were encouraged by these first results, so we decided to go one step further by looking more widely for completely new architectures that could provide even better results. By not limiting ourselves to combining the already discovered NAS cells, we could look more directly for architectures that took into account our strict latency requirements.

Conducting an end-to-end search ordinarily requires exploring thousands of architectures manually, which carries large computational costs. Exploring a single architecture requires several days of training on a data center computer with multiple GPU cards, meaning it would take thousands of days of computation to search for a single task. Instead, we designed a proxy task: a scaled-down LiDAR segmentation task that could be solved in just a matter of hours.

One challenge that the team had to overcome was finding a proxy task similar enough to our original segmentation task. We experimented with several proxy task designs before we could ascertain a good correlation between the quality of architectures on that task and those found on the original task. We then launched a search similar to the one from the original AutoML paper but now on the proxy task: a proxy end-to-end search. This was the first time this concept has been applied for use on LiDAR data.

Proxy end-to-end search: Explore thousands of architecture on a scaled-down proxy task, apply the 100 best ones to the original task, validate and deploy the best of the best architectures on the car.

We used several search algorithms, optimizing for quality and latency, as this is critical on the vehicle. Looking at different types of CNN architectures and using different search strategies, such as random search and reinforcement learning, we were able to explore more than 10,000 different architectures for the proxy task. By using the proxy task, what would have taken over a year of computational time on a Google TPU cluster only took two weeks. We found even better nets than we had before when we had just transferred the NAS cells: