Enabling Cognitive Visual Question Answering

Exploring a hybrid approach to visual question answering through deeper integration of OpenCog and a Vision Subsystem.

Introduction

Let us imagine a scenario in which Sophia, the social humanoid robot, is asked a simple question by someone:

“Sophia, is it raining?”

If Sophia says “yes” to the question, does she know why she gave that answer? In other words, how does Sophia answer the question?

The ability to answer questions about visual scenes, in other words, the ability to perform Visual Question Answering (VQA) is something that comes naturally to humans. However, the current state-of-the-art models of VQA leave much to be desired.

One of the control systems used to operate Sophia is OpenCog, a cognitive architecture. OpenCog operates over a knowledge base represented as a hypergraph called Atomspace. For Sophia to accurately answer questions about visual scenes, the content of those scenes needs to be made accessible to OpenCog.

In an earlier research article, we discussed that the simplest way to achieve that would be to process images with a Deep Neural Network (DNN) and to insert the descriptions of the images into Atomspace. One example of such a DNN would be YOLO, which describes an image with a set of labeled bounding boxes.

Although such a simple approach can be useful for semantic image retrieval, it will not be sufficient for answering arbitrary visual questions.

For instance, the questions “is the child happy?” and “is the boy jumping?” can refer to the same bounding box. This makes it clear that one label per bounding box is not enough. Furthermore, it would be difficult to assign all possible relevant labels by the image processing system in advance. The insufficiency of such a simple approach becomes even more obvious when the questions require analyzing visual features and not labels. For example, it might be asked “are the people looking in the same direction?” or perhaps “are the chairs similar?”

It is obvious that in order to answer such questions, deeper integration of the cognitive architecture and a vision subsystem is required. The question, therefore, is how do we achieve such a deeper level of integration?

Grounded Predicates

OpenCog supports grounded predicates. Simply put, these are predicates with truth-values that are calculated by some external procedures, in particular by DNNs.

In our previous research article, we shed light on how OpenCog’s Pattern Matcher can be used to execute queries like, find a video frame containing a man in a car . If the corresponding labels are assigned correctly to the bounding boxes, the retrieval will be successful.

The Pattern Matcher can be used in the same way even if we do not assign labels to the bounding boxes and connect them to their corresponding Concept Nodes by Inheritance Links. That is because the Pattern Matcher can use the grounded predicates instead. In such an approach, rather than verifying the presence of the corresponding Inheritance Link, the Pattern Matcher will simply evaluate the corresponding grounded predicates.

To understand how this approach will work, let us first consider a simple “yes” or “no” question:

“Are the zebras fat?”

To parse the question, OpenCog utilizes the Link Grammar parser and the RelEx Dependency Relationship Extractor. The Link Grammar parser will produce something like this:

(S are.v (NP the zebras.n) (ADJP fat.a) ?)

or in the graphical form:

Link Grammar parse tree

The RelEx form will simply be _predadj(zebra, fat) .

This can be further converted to the query in Atomese , which is an internal language of OpenCog, to represent knowledge and thoughts. Over here, we use a Scheme syntax:

(SatisfactionLink (TypedVariable (VariableNode “$X”) (Type “ConceptNode”)) (AndLink (InheritanceLink (VariableNode “$X”)

(ConceptNode “BoundingBox”)) (EvaluationLink (GroundedPredicateNode “py:recognize”) (ListLink (VariableNode “$X”) (ConceptNode “zebra”))) (EvaluationLink (GroundedPredicateNode “py:recognize”) (ListLink (VariableNode “$X”) (ConceptNode “fat”)))))

In the syntax above, we have one grounded predicate recognize implemented in Python. Furthermore, we could have used grounded predicates like zebra and fat directly, but passing Concept Nodes (which can be connected to other pieces of knowledge) to a single predicate is more flexible.

Lastly, the Pattern Matcher will automatically find such a grounding for Variable $X , which is a bounding box. For this Variable, the evaluation of the grounded predicate recognize (implemented as a DNN) will give high truth value for both zebra and fat .

We can also convert other questions to the Pattern Matcher queries. For example, the question “What color is the plane?” can be automatically converted to the Relex form:

_det(color, _$qVar);_obj(be, plane);_subj(be, color)

After which, the following Pattern Matcher query can be constructed:

(BindLink (VariableList (TypedVariable (Variable”$B”) (Type “ConceptNode”)) (TypedVariable (Variable “$X”) (Type “ConceptNode”))) (AndLink (InheritanceLink (VariableNode “$B”)

(ConceptNode “BoundingBox”)) (InheritanceLink (VariableNode “$X”) (ConceptNode “color”)) (EvaluationLink (GroundedPredicateNode “py:recognize”) (ListLink (VariableNode “$B”) (ConceptNode “plane”))) (EvaluationLink (GroundedPredicateNode “py:recognize”) (ListLink (VariableNode “$B”) (VariableNode “$X”)))) (ListLink (VariableNode “$B”) (VariableNode “$X”)))

Here, we need two variables to be grounded. The first variable is the bounding box, while the other is the answer to the question, which is passed to recognize . The Pattern Matcher will consider only those groundings for this variable, which correspond to concept nodes inherited from the Concept color . Although the knowledge about such ontological relations should be presented in Atomspace, it can also be directly extracted from the training set.

Comparison with pure DNN models

One might wonder, why don’t we use purely DNN-based solutions that have state-of-the-art results on existing VQA benchmarks. The reason is simple; such DNN-based solutions aim to solve specific tasks and are trained to exploit the biases of specific datasets. Such models, therefore, are not cognitive architectures.

More specifically, VQA models are trained to map pairs (image, question) to one-word answers and cannot perform anything else. In contrast, cognitive architectures answer visual questions not by a stand-alone model but rather by an integrative system that can also perform other actions. Such actions may depend on the answer or even influence it.

Perhaps an analysis of DNN models might help to reveal the possible improvements of the hybrid system that possesses both symbolic and subsymbolic components. Therefore, let us consider one of the recent state-of-the-art VQA models:

A state-of-the-art DNN model for VQA

This model relies on bounding boxes with a pre-trained high-level feature extractor. It uses standard word embeddings pre-trained on large text corpora that were not related to the VQA datasets. Furthermore, the model uses an “attention” mechanism that assigns weights to bounding boxes depending on their relevance to the question, based on its trainable embedding calculated by the GRU.

Another VQA model, rather than using the whole question embedding, used individual words to calculate attention weights for the bounding boxes.

In essence, such attention is similar to recognizing if the bounding boxes correspond to certain words performed by the DNN-based grounded predicates executed by the Pattern Matcher. The difference is that the VQA models use word embeddings matched with visual features, instead of separate classifiers. Such an approach can be implemented in OpenCog; we would only need to implement the grounded predicate recognize in the following form:

Implementation of the grounded predicate that matches visual features with the word embedding

One interpretation of this network would be to consider it as a predicate of two arguments, that takes both the word embedding and visual features as input.

What we find interesting is that such a construct can be generalized with the use of HyperNets, which can take the word embedding as input and return a network that can then accept the visual features as the input — giving us the probability with which those features correspond to the given word.

However, despite the similarities, there are some considerable differences between the two approaches. For instance, the attention mechanism in the DNN-based VQA models operates by first assigning weights to the bounding boxes, the features of these bounding boxes are then summed, after which the bounding boxes are fused with the question embedding and passed to the classifier. Such a classifier might be considered to be similar to a set of separately grounded predicates trained to recognize each concept. However, the classifier is not applied to the visual features of each bounding box. Instead, it is applied to the summed features of all the bounding boxes filtered with the use of question embedding.

In the construction outlined above, if the question is “What color is the car?” only those features are kept, which are relevant to the question so that the classifier can answer red instead of car . Such a construction is somewhat strange. To illustrate why let us compare the questions:

“Is the car red?” and “What color is the car?”

When presented with the first question, the attention mechanism will be responsible for selecting the bounding boxes corresponding to the notion of red(car) , whereas, if presented with the second question, the classifier will recognize the selected bounding boxes as red . In contrast, OpenCog’s Pattern Matcher will do much more similar operations when presented with the two questions.

More importantly, not all questions will have such straightforward formalization. For example, the question: “Is it raining?” does not mean that there is a bounding box which can be characterized as it and raining . Rather, the model should look for umbrellas and clouds .

We can also recognize any bounding box as it and recognize umbrellas as raining . That is actually what DNN-based solutions also do; they simply associate the positive answer to this question with the presence of umbrellas. However, the DNN-based solutions can, in principle, account both for clouds and umbrellas by summing up visual features of the corresponding bounding boxes.

Similarly, questions like “Is he eating?” or “Is the room dark?” require the DNN model to attend to several bounding boxes. Although the weighted summation is a very simplified way to do that, it somehow works.

The problem, of course, is that the DNN model does not know why it answers “yes” to the question “Is he eating?”

The model simply matches the summed visual features with the holistic question embedding. As we have said earlier, although such holistic matching is robust, it is far from perfect. If it encounters an unexpected question, it can give a non-sensible answer.

For example, when fed the following image and asked the question “What is tree doing?” The demo DNN model at this site answers with a 67% certainty that it is eating and attaches a 13% certainty to the answer that it is walking .

The image to which the model replies that the “tree” is “eating” and “walking.”

Therefore, we believe that a hybrid approach that works at both the feature and the symbolic levels can do better. Although, such an approach is more challenging to implement.

For example, OpenCog contains a module called Pattern Miner, which is intended for finding patterns in Atomspace. In particular, we believe that it might be able to find that the concept of raining is associated with the concepts umbrella and cloud , while the concept of eating is associated with the concepts of animals or human and food .

If that is the case, then hopefully a future model of Sophia that would be operated through SingularityNET (which will leverage the OpenCog framework) will be able to explain why she thinks it is raining when asked the question: “is it raining?” Unlike the models that match a mixed visual feature vector with the question embedding, a SingularityNET operated Sophia would hopefully be able to find that the concept raining is associated with several other concepts such as umbrella , clouds and water drops .

What are the challenges?

Of course, there are still several difficult challenges that need to be overcome. For instance, mapping from natural language questions to the form understandable by OpenCog needs to be learned. We believe that SingularityNET’s Unsupervised Language Learning project can contribute to solving this problem.

We also believe that a deepening of symbolic-subsymbolic integration is necessary. More specifically, training signals should be propagated through symbolic inference traces — not only to reach the DNNs of the vision subsystem (which is trained separately now) but also to modify the truth-values of the nodes and links constituting the knowledge base.

Lastly, further development of the vision subsystem is also required. For instance, to robustly recognize that “the boy is holding the cup” one might need to perform a semantic segmentation or perhaps even a 3D reconstruction. It would be quite the challenge to make such finer image details accessible from the symbolic level. One solution to overcome that challenge might be to construct a generative model of a scene that spans both the symbolic and subsymbolic levels. Such a generative model might also be necessary to deal with other tasks such as text-to-image transformation.

How can you get involved?

We will report on our progress in overcoming the challenges that we face in the future research articles. You can visit our Community Forum to chat about the research mentioned in this article.

Over the coming weeks, we hope to not only provide you with more insider access to SingularityNET’s groundbreaking AI research but also to share with you the specifics of our development.

For any additional information, please refer to our roadmaps and subscribe to our newsletter to stay informed regarding all of our developments.