GQN builds upon a large literature of recent related work in multi-view geometry, generative modelling, unsupervised learning and predictive learning, which we discuss here, in the Science paper and the Open Access version. It illustrates a novel way to learn compact, grounded representations of physical scenes. Crucially, the proposed approach does not require domain-specific engineering or time-consuming labelling of the contents of scenes, allowing the same model to be applied to a range of different environments. It also learns a powerful neural renderer that is capable of producing accurate images of scenes from new viewpoints.

Our method still has many limitations when compared to more traditional computer vision techniques, and has currently only been trained to work on synthetic scenes. However, as new sources of data become available and advances are made in our hardware capabilities, we expect to be able to investigate the application of the GQN framework to higher resolution images of real scenes. In future work, it will also be important to explore the application of GQNs to broader aspects of scene understanding, for example by querying across space and time to learn a common sense notion of physics and movement, as well as applications in virtual and augmented reality.

While there is still much more research to be done before our approach is ready to be deployed in practice, we believe this work is a sizeable step towards fully autonomous scene understanding.