Deep Semantic Gap to be bridged for SingularityNET

How SingularityNET is being designed to allow for more effective models of higher cognitive functions.

Introduction

In the midst of this AI Spring, it may seem strange that the prophecies concerning the future of AI have become a bit cold. The warnings that the end is nigh have increased, and to some, it may seem that winter is coming for AI.

Having set the expectations that Deep Neural Networks may make it possible to implement higher cognitive functions, such as reasoning or problem-solving, many AI researchers and developers now feel deceived.

For though Deep Neural Networks have seen some progress in that direction, it is far less impressive than that seen in Object Recognition.

Consider, for example, this simple question:

“Where is the giraffe?”

When the Deep Neural Networks (DNN) used in state-of-the-art solutions for Visual Question Answering (VQA) are asked that question, the probability is high that the answer will be “zoo.” Even if the image is that of a toy giraffe lying on the carpet of a child’s room or of a T-shirt with a picture of a giraffe.

You can use this online demo to ask some questions of your own. As you read the answers, you may begin to realize why the warnings are increasing.

However, despite the attention and funding it has received, deep learning does not constitute the whole field of AI. In fact, researchers have tried modeling higher-level cognitive functions long before the emergence of deep learning.

These models of higher-level cognitive functions were, however, divorced from sensory data. Such a separation significantly limited the usability of these models, as it was difficult for them to acquire new knowledge automatically.

Cognitive Architectures, were conceived as integrated systems; with Emergent Cognitive Architectures being essentially similar to Deep Neural Networks and Symbolic Cognitive Architectures inheriting the limitations of Good Old-Fashioned AI (GOFAI).

If we are to overcome the difficulties currently faced in modeling higher cognitive functions, bridging the symbolic/subsymbolic gap — also known as the Semantic Gap — is necessary for both Deep Neural Networks and Symbolic Cognitive Architectures.

Therefore, SingularityNET will be populated by different types of nodes that will not only facilitate the integration of Deep Neural Networks with Symbolic Cognitive Architectures but also help in overcoming their limitations.

But how exactly do we combine Deep Neural Networks with symbolic reasoning? While Hybrid Cognitive Architectures are trying to bridge this gap, the problem is far from being solved.

Consider the example of OpenCog. Although it has a hybrid architecture, its core (the most developed part) is much better at solving symbolic problems than processing raw sensory data.

Image Retrieval

When we try to recall the context in which we saw a specific image, our brain does not enumerate all seen images — or even their high-level features — but reduces the possible contexts by “deducing” the relevant places and events.

These types of queries can be naturally represented in OpenCog. Our recent paper described the first (simplified yet instructive) experiment with such semantic image retrieval.

In this experiment, each frame of a video was processed with the use of YOLOv2 object detector. The detected Bounding Boxes (BBs) were inserted into AtomSpace — OpenCog’s hypergraph knowledge base.

The node for each video frame was added together with the nodes for each Bounding Box (BB). Then, these BB nodes were connected to the frame nodes by member links (indicating that the bounding boxes are a part of the particular frame).

Lastly, each BB node was linked to a concept node corresponding to not only the label assigned to the BB node but also to the coordinates of BB corners.

A fragment of the knowledge graph

Given the output of the object detector, it is quite straightforward to populate AtomSpace with such nodes and links. After that, it is possible to execute the simple queries, like: “find all frames containing a car and a helicopter.”

A corresponding query can be expressed using GetLink (or BindLink, if we want the results put into AtomSpace) which uses OpenCog’s Pattern Matcher to find a subgraph in AtomSpace that corresponds to a given pattern (i.e., a graph with unspecified, or variable, nodes).

A (hyper)graph of a query

To be able to execute more interesting queries, it is necessary to represent some relations between objects. Mutual locations of the Bounding Boxes can be utilized in that regards.

So, for example, queries like “a vase on a table,” or “a painting with a person,” can be formalized using easy to implement predicates such as Higher and Inside. As a result, we successfully retrieved frames like the ones shown below:

Found frame with a person in a car

Found frame with a person wearing a backpack

Benefits and Limitations: Comparison with pure Deep Neural Networks Solutions

If we store a sequence of video frames and search those frames using neural networks, it will not only be an inefficient process but will also be difficult to execute. Hence, semantic image retrieval is not being implemented purely on the base of Deep Neural Networks.

The state-of-the-art solutions for Visual Question Answering (VQA) are end-to-end trainable Deep Neural Networks. Although such solutions are relatively successful, they are far from perfect. The Deep Neural Networks do not understand the images or the questions and can be easily fooled.

The giraffe example (mentioned at the beginning of this article) illustrates the limitations of Deep Neural Networks. The error is easy to explain: the Deep Neural Networks, over some time, formed an association between the question, images of giraffes and “zoo” as an answer.

Furthermore, while Deep Neural Networks are good at answering typical questions about salient objects, they have difficulty handling questions that require several steps of inference. This is because Deep Neural Networks are bad at generalization and reasoning. One example of such a question will be “what is the color of the dress on the girl to the left of the man?”

OpenCog can easily perform several steps of reasoning, and it also allows for generalization. However, there are some obstacles in applying OpenCog to Visual Question Answering.

In OpenCog, although we can naturally map VQA questions to the Pattern Matcher queries (which is a remarkable fact), the labels of the Bounding Boxes are not enough to create the answers.

Consider, for example, this simple question: “what is the color of the car?”

If we do not have a link in a knowledge graph that connects some Bounding Box with both: the concept node “car” and the concept node corresponding to some color, the Pattern Matcher will not be able to answer the question.

This limitation exists because of several reasons.

For one, object detectors do not provide additional labels such as colors or supposed actions for the detected Bounding Boxes. More importantly, some labels cannot be assigned on the basis of the content of a specific Bounding Box and need to be inferred from the context. An example of such a case would be: “this bear is most likely eating because there is a piece of meat nearby.”

This need for inferring labels from context highlights why we need some kind of reasoning for VQA. So although the VQA questions can be represented as queries to the knowledge base, answering the queries would require diving deeper into sensory data.

Let us now analyze how successful Deep Neural Network based VQA models function. Such models extract Bounding Boxes and use the word embeddings from questions to allocate attention between the Bounding Boxes.

In these models, unlike our preliminary experiment discussed above, Bounding Boxes are not represented by their labels but by their higher-level features. These representations are then concatenated with the word embeddings from questions and passed to a classifier. However, that is not all, top-down attention may further require selecting relevant Bounding Boxes and also detecting any missing non-salient objects (e.g., “Do all men have a beard?”).

For the semantic gap to be bridged, therefore, a shallow integration is not enough, and cognitive feedback from reasoning to perception is needed. Furthermore, integration should be close enough to allow for end-to-end learning. In fact, it is this absence of a close integration that limits the usability of traditional AI systems. And it is for this reason that unsupervised language learning is needed for OpenCog.

It can also be said that the Deep Neural Networks are less brittle as compared with Good Old-Fashioned AI (GOFAI) systems. It may even be argued that they are way too robust. For they not only answer senseless questions very confidently but also, at times, ignore some words in the question to focus on the wrong Bounding Box and give pointless answers (e.g. “the rock is drinking water”).

Therefore, this need to balance both robustness and strictness will require a deep integration of different kinds of systems.

How Can You Get Involved?

We hope this post clarified the necessity of bridging the deep semantic gap. In our future posts, we will further describe our plans to overcome this challenge. By doing so, we hope to allow for more effective models of higher cognitive functions to be deployed on SingularityNET.

Be sure to visit our Community Forum to chat about the research mentioned in this post. Over the coming weeks, we hope not to just provide you with more insider access to SingularityNET’s groundbreaking AI research but also share with you the specifics of our development.

For any additional information, please refer to our roadmaps and subscribe to our newsletter to stay informed about all of our developments.