The filter modules in the example above accept the attention maps produced by previous modules, which are used to mask an input image (represented by its high-level ResNet-101 features) and produce new attention maps. At the end of the program, there is typically a query module (which also takes a feature and attention maps as input, but produces a new feature map as output), followed by a classifier.

The programs are run step-by-step by the execution engine written in Python. This, in essence, is an imperative domain-specific language in which a simple model translates questions without any use of explicit knowledge about visual concepts and their relations (while even mere classifiers can benefit from the use of knowledge graphs). Does this correspond to the “underlying reasoning processes”?

Therefore, it is tempting to replace the imperative-style domain-specific execution engine with a declarative knowledge-base general reasoning system — making it even more transparent and interpretable.

Declarativization of Visual Reasoning

As we discussed in one of our previous posts, cognitive architectures can be used to integrate the DNN image analysis capabilities with general-purpose reasoning engines over knowledge bases to enable cognitive VQA.

For simple questions, the imperative question-answering programs can be easily converted to declarative queries.

Consider, for instance, the question “What color is the cylinder?” The corresponding program will look like:

00 = {‘inputs’: [], ‘function’: ‘scene’, ‘value_inputs’: []}

01 = {‘inputs’: [0], ‘function’: ‘filter_shape’, ‘value_inputs’: [‘cylinder’]}

02 = {‘inputs’: [1], ‘function’: ‘query_color’, ‘value_inputs’: []}

The query color module takes as input the image features masked by the attention map produced by filter_shape[cylinder] module and outputs new feature map, which is then passed to the final classifier. The multinomial classifier calculates the probabilities of all answers.

However, we can explicitly introduce the query variable $X , replace the query module with the corresponding filter module filter_color[$X] , and ask the reasoning engine to find such value of $X which will produce a non-empty final attention map.

Therefore, the question “What color is the cylinder?” can be represented declaratively using OpenCog’s Atomese language:

(AndLink

(InheritanceLink

(VariableNode “$X”)

(ConceptNode “color”))

(EvaluationLink

(GroundedPredicateNode “py:filter”)

(ConceptNode “cylinder”))

(EvaluationLink

(GroundedPredicateNode “py:filter”)

(VariableNode “$X”)))

In the example above, the predicates are grounded in deep neural networks — which can be the same modules as filter modules in neural module networks processing high-level image features (omitted here for the sake of brevity).

OpenCog’s reasoning system (Pattern Matcher or the Probabilistic Logic Networks, PLN) will try to find an appropriate grounding for the variable $X , for which the whole conjunction is true. This is the reason why we don’t need to have both filter attribute and query attribute modules — query attribute modules extended by a classifier is an imperative way to determine an appropriate answer. The knowledge about ontological relations between concepts (e.g., “red” and “color”) is explicitly utilized through the InheritanceLink.

In one of our previous posts, we used models which deal with bounding boxes. OpenCog’s reasoning system had to find the bounding boxes and apply grounded predicates to them to satisfy the declarative query. Consequently, grounded predicates could produce just one truth value per bounding box, and truth values of logical expressions over these predicates could be easily calculated with the use of PLN.

However, most models for CLEVR deal with dense attention maps. This leads us to some interesting scenarios: should the declarative reasoning system attend each element of the dense attention maps? Should it necessarily deal with the extracted objects? Should it operate with whole attention maps as tensor truth values?

All of these solutions are possible with their own strong and weak points, and possibly, should be somehow combined. In humans, we observe that we can deliberately attend individual pixels in images, but we don’t usually consciously reason about each pixel.

For the sake of simplicity, however, let’s assume that we choose to deal with attention maps as a whole. In such a scenario, we will be able to see some additional differences between the use of module networks in imperative and declarative fashions. Consider the program for a slightly more complicated question like “What color is the large cylinder?”

What color is the large cylinder?

The first filter: size[large] will take the image feature maps and the default attention map (covering the whole image) and produce the attention map that highlights large objects.

The next filter: shape[cylinder] will take the image features masked by the attention map produced by the previous filter and output a new attention map, which can then be used to mask the image features once again, and these features will be passed to the query color module — which will produce the new features fed to the final classifier. Also, note that the result for different order of filters can be different (especially if we consider two filters for attributed), which seems somewhat strange.

In contrast, a less imperative form of this program would be the same as above — but with the additional conjunct (EvaluationLink (GroundedPredicateNode “py:filter”) (ConceptNode “large”)) . That is, the attention maps would be produced by all filters independently and then And’ed (which can be done pixel-wise).

The query attribute modules can appear in the middle of the program and not at the end. For example, in the case of a question: “Is the small gray object made of the same material as the big cube? There will be two query material modules executed after the consequent filtering of small+gray and large+cube. They will produce two feature maps, which will then be passed to the comparison module:

00 = {‘inputs’: [], ‘function’: ‘scene’, ‘value_inputs’: []}

01 = {‘inputs’: [0], ‘function’: ‘filter_size’, ‘value_inputs’: [‘small’]}

02 = {‘inputs’: [1], ‘function’: ‘filter_color’, ‘value_inputs’: [‘gray’]}

03 = {‘inputs’: [2], ‘function’: ‘query_material’, ‘value_inputs’: []}

04 = {‘inputs’: [], ‘function’: ‘scene’, ‘value_inputs’: []}

05 = {‘inputs’: [4], ‘function’: ‘filter_size’, ‘value_inputs’: [‘large’]}

06 = {‘inputs’: [5], ‘function’: ‘filter_shape’, ‘value_inputs’: [‘cube’]}

07 = {‘inputs’: [6], ‘function’: ‘query_material’, ‘value_inputs’: []}

08 = {‘inputs’: [3, 7], ‘function’: ‘equal_material’, ‘value_inputs’: []}

This program can be converted into a declarative form. For example:

(AndLink

(Inheritance (Variable “$X”) (Concept “material”))

(Inheritance (Variable “$Y”) (Concept “material”))

(EqualLink $X $Y)

(AndLink

(Evaluation (GroundedPredicate “py:filter”) (Concept “small”))

(Evaluation (GroundedPredicate “py:filter”) (Concept “gray”))

(Evaluation (GroundedPredicate “py:filter”) (Variable “$X”)))

(AndLink

(Evaluation (GroundedPredicate “py:filter”) (Concept “large”))

(Evaluation (GroundedPredicate “py:filter”) (Concept “cube”))

(Evaluation (GroundedPredicate “py:filter”) (Variable “$Y”))))

Thanks to the absence of the necessity to pass attention maps from filter to filter, this expression can be simplified with the use of only one variable.

Here, internal AndLinks deal with tensor truth values, while external AndLink is a traditional PLN conjunction, so EqualLink will compare concepts — not feature maps as done by the equal_material module.

Thus, this is not only a computationally different representation of the question but also its different formalization. A question about the similarity of colors can imply either a positive or negative answer for green objects, with a different shade, and humans can answer this question by focusing on both visual features and names of colors.

Is a declarative formalization of questions more natural?

In terms of what we want to find, this way is definitely better. However, it might not be better in describing how do humans actually find the answer.

For instance, consider the question “What color is the tiny matte block left of the blue block?”

In the neural module networks, the notion of “left” is represented as a neural network that also accepts feature maps masked by the attention output of the previous module, and produces another attention map: