A key challenge in developing artificial intelligence systems with the flexibility and efficiency of human cognition is giving them a similar ability - to reason about entities and their relations from unstructured data. Solving this would allow these systems to generalize to new combinations of entities, making infinite use of finite means.

Modern deep learning methods have made tremendous progress solving problems from unstructured data, but they tend to do so without explicitly considering the relations between objects.

In two new papers, we explore the ability for deep neural networks to perform complicated relational reasoning with unstructured data. In the first paper - A simple neural network module for relational reasoning - we describe a Relation Network (RN) and show that it can perform at superhuman levels on a challenging task. While in the second paper - Visual Interaction Networks - we describe a general purpose model that can predict the future state of a physical object based purely on visual observations.

A simple neural network module for relational reasoning

To explore the idea of relational reasoning more deeply and to test whether it is an ability that can be easily added to existing systems, we created a simple-to-use, plug-and-play RN module that can be added to existing neural network architectures. An RN-augmented network is able to take an unstructured input - say, an image or a series of sentences - and implicitly reason about the relations of objects contained within it.

For example, a network using RN may be presented with a scene consisting of various shapes (spheres, cubes, etc.) sitting on a table. To work out the relations between them (e.g. the sphere is bigger than the cube), the network must take the unstructured stream of pixels from the image and figure out what counts as an object in the scene. The network is not explicitly told what counts as an object and must figure it out for itself. The representations of these objects are then grouped into pairs (e.g. the sphere and the cube) and passed through the RN module, which compares them to establish a “relation” (e.g. the sphere is bigger than the cube). These relations are not hardcoded, but must be learnt by the RN as it compares each possible pair. Finally, it adds up all these relations to produce an output for all of the pairs of shapes in the scene.

We tested this model on several tasks including CLEVR - a visual question answering task designed to explicitly explore a model’s ability to perform different types of reasoning, such as counting, comparing, and querying. CLEVR consists of images like this: