Predicting the future

Predicting the future is an important part of designing an artificial driving intelligence. It allows us to evaluate the quality of a given action and reason about scene dynamics: predicting how other pedestrians and cars will act around us. Doing this accurately and reliably is one of the hardest challenges in the autonomous driving industry today.

Current approaches are typically unable to accurately estimate multi-modal uncertainty and predict the stochasticity of the future. Systems that can are often not differentiable or end-to-end trainable, meaning that they reason over hand-coded representations. Perhaps most importantly, they often predict the behaviour of each actor in the scene independently, and rely on an HD-map to predict the static scene. This fails in many multi-agent interaction situations such as traffic merging or with HD-map errors due to e.g. roadworks.

In this blog we describe a new paradigm that is able to overcome these shortcomings. It allows our end-to-end deep driving policy to explicitly reason about the future.

Consider this urban driving scene (five video frames covering one second of context):

There are many possible futures approaching this four-way intersection. Here are a few predictions from our probabilistic deep learning model, demonstrating diverse and plausible future prediction. These samples predict 10 frames, or two seconds into the future. The image on the right shows semantic segmentation of the scene, labelling pixels as belonging to classes such as road (mauve), pavement (pink), vehicle (blue), building (grey), etc.

You can see our probabilistic model is able to predict different modes; turning left, right or continuing straight through this intersection.

Our model can also predict the behaviour of dynamic agents in the scene, showing moving traffic. Note the second car in the oncoming lane is occluded in the past context, but is predicted accurately in the future. Our representation can understand semantics, geometry and motion.

From left: input sequence, semantic segmentation, segmentation uncertainty, monocular depth and optical flow predictions.

We have released full details of this work in our arxiv preprint here. This work also follows our previous blog where we wrote about using predictive models for episodic replay to train a model-based reinforcement learning model in our previous blog, Dreaming to Drive.

Learning to align the present distribution with the future distribution

The core idea behind this work is that the future is uncertain, given our limited observation at present-time. By collecting driving data, we can observe one unique instance of this future distribution. However, this is only one instance; we need to understand the distribution of possible futures. This distribution may be multi-modal and complex. For example, the car in front of us may stop or continue driving. Or when we turn a corner, the static environment may have a clear road or contain road works.

To solve this, we construct a model which learns to model the probability of future events. As the event space is very highly dimensional we learn a compressed latent space to calculate the probability distribution, called the future distribution. This future distribution is trained from privileged information, as it must be trained from the observed future sequences. Therefore we learn a second distribution, the present distribution, which only has access to past data, and is trained to match the future distribution through a Kullback-Leibler divergence loss.

We can then sample from the present distribution during inference, when we do not have access to the future. We observe that this paradigm allows the model to learn accurate and diverse probabilistic future prediction outputs.

Deep learning architecture

Our model learns a spatio-temporal feature to jointly predict future scene representation (semantic segmentation, depth, optical flow) and train an autonomous driving policy. The neural network architecture contains five components:

Perception: an image scene understanding model Dynamics: a temporal model Present/Future Distribution: our probabilistic framework Future Prediction: a module that predicts future video scene representation Control: a module which trains a driving policy using expert driving demonstrations.

More details on the architecture can be found in the arxiv preprint.

Results: what our model predicts

We trained our model on over 200 hours of driving data, collected in London, UK. This model was trained for 5 days on a server with 8x 2080Ti NVIDIA GPUs with PyTorch. Here are our results.

Predicting plausible, multi-modal futures. Our model is able to generate diverse, multi-modal futures. This is important for high dimensional and complex urban driving scenes. Here we show our model predicting different modes, driving different routes through a four-way intersection.

Next, if we advance the video a little to the point where our vehicle has started to make a left turn, we see different futures when we sample. Now that it is no longer plausible for our vehicle to take any other turn, all samples show the left turning mode.

Understanding scene semantics, geometry and motion. Here are examples that show our system is able to reason about different representations in a multi-task fashion. We show our representation decoding to semantic segmentation, depth and optical flow.

This scene shows future prediction in a complex unprotected right turn.

Predicting multi-agent interactions. Our model is also able to predict other multi-agent behaviour in urban driving scenes. Here are some examples of predicting future behaviour of various other dynamic agents. In the first example, our model predicts stationary traffic in front of us. In the second sample, we predict pulling away with traffic.

This demonstrates our model can jointly predict interactions between ourselves and third-party agents.

Predicting pedestrian behaviour. Our representation also understands how to predict basic pedestrian behaviour (pedestrians are predicted in red in this semantic segmentation image), however there is work to be done to better represent individual instances.

Multi-lane road driving. In our final example, we show the model predicting different behaviour on a multi-lane road. The first sample shows the vehicle driving straight. In the second, it veers around the parked vehicle before moving towards the left turning lane.

Multi frame, multi future

There has been a lot of great work in this area which has inspired this research. However, we’re excited about the scale of this new model, both in terms of input-state dimensionality (relatively large resolution videos) and the size of the training dataset and model.

Previous work on probabilistic future prediction focused on trajectory forecasting [DESIRE, Lee et al. 2017, Bhattacharyya et al. 2018, PRECOG, Rhinehart et al. 2019] or were restricted to single-frame image generation and low resolution (64x64) datasets that are either simulated (Moving MNIST) or with static scenes and limited dynamics (KTH actions, Robot Pushing dataset) [Lee et al. 2018, Denton et al. 2018]. Our new framework reasons in a more complete scene representation with segmentation, depth and flow and can generate entire video sequences on complex real-world urban driving data with ego-motion and complex interactions.

Further, our variational approach to model the stochastic nature of the future does not use multi-modal training data, i.e. multiple output labels are provided for a given input. We instead learn directly from real-world driving data, where we can only observe one future reality, and demonstrate that we generate diverse and plausible futures.

How this benefits autonomous driving

Improves control performance. We observe improvement in control performance metrics when using this future-aware representation, compared to a baseline. This is particularly true for multi-agent interaction scenarios.

We observe improvement in control performance metrics when using this future-aware representation, compared to a baseline. This is particularly true for multi-agent interaction scenarios. Understanding uncertainty. Awareness of the size of the future distribution can illustrate how multi-modal or uncertain a scene is for our model. This is useful for understanding edge-cases and when the model needs to ‘pay more attention’.

Awareness of the size of the future distribution can illustrate how multi-modal or uncertain a scene is for our model. This is useful for understanding edge-cases and when the model needs to ‘pay more attention’. Injecting prior knowledge. We strongly advocate for end to end, but with finite data and compute we need to be clever about injecting domain knowledge into our models. We do this with neural network architectural choices and auxiliary learning signals, while still training end to end. This model is an excellent example of this.

What is the future of this future prediction work?

More structured representation. Right now, the only structure we enforce in our latent state is perspective projection, with spatial dimensions inferred from the input image. We know that perspective projection is sub-optimal in many ways, compared to other scene representations. We also know a lot about scene dynamics from object semantics and structure from motion, for example. There are many ways our representation could leverage these ideas to predict more accurately and further into the future.

Right now, the only structure we enforce in our latent state is perspective projection, with spatial dimensions inferred from the input image. We know that perspective projection is sub-optimal in many ways, compared to other scene representations. We also know a lot about scene dynamics from object semantics and structure from motion, for example. There are many ways our representation could leverage these ideas to predict more accurately and further into the future. Model based reinforcement learning (MBRL). One of the hardest challenges about scaling MBRL systems to real-world robots is building high dimensional dynamics models. Our future prediction model is a great start to use in a MBRL model. Further analysis will be required to understand how well the future prediction can explore and generalise into the state space, and to construct a reward model from the future latent states.

For more details, here is a video explanation and the research paper: