One of the first ideas many people have once they get acquainted with deep learning and neural networks is, “what if we make a network of neural networks?”. That is a perfectly valid idea, one that is explored in many different ways. The most common place one finds this kind of approach is in automated machine learning (AutoML), where small pieces of mini-networks are recombined to find a complete neural architecture that ideally suits some machine learning problem. What I want to explore in this article is both very similar to AutoML, yet very different. My goal is to convey an idea, rather than present formulas. If you are interested in the details, you can find them in our paper “A Neural-Symbolic Architecture for Inverse Graphics Improved by Lifelong Meta-Learning” (https://arxiv.org/abs/1905.08910).

So, what if we could design a network for computer vision where each node represents some object and the connections indicate what objects are part of another object? What if this network encounters a new object and simply adds it as a node to the network? It might look something like this:

A Network that grows with each new object it sees, by connecting it to other objects that make up its parts.

Sure, in some sense our current convolutional neural networks (CNN) do something very similar internally. By retraining them, you’d probably also be able to learn additional objects as well. But why retrain everything, if most of the network isn’t doing anything new? And at what point do we need to enlarge a neural network to fit more object types? The above image seems to indicate that there might be an alternative approach.

Generative Grammar

Let’s take a step back. A large step back. Let’s look at the reverse scenario. What if we want to generate an image based on a single word, such as [House]? Well, we would take a look at what this [House] is made of. Probably a [Roof] and some [Ground Floor]. And what is the [Ground Floor] made of? Multiple [Wall]s, a [Door] and [Window]s. Everything is made of something. This process can be summarized in a generative grammar and we call the individual things we put in brackets a symbol.

A partial parse-tree of a generative grammar for a [House].

Well, that looks awkwardly close to the network we had before, just the other way around… And we’ll get to that. But first, we’ll explore how to get from [House] to a fully drawn picture of an actual house. There is a set of extremely primitive graphical elements from which we can draw anything. The most basic is obviously the pixel, but we won’t go that deep (although you could). Take something higher level and let’s assume you can draw the entire world using nothing but [Square]s, [Circle]s and [Triangle]s. What would our house look like then?

A [House] drawn from a starting symbol using triangles and squares.

Cool. Do we have control over this process? Sure! Each element can have attributes that controls how it’s drawn. A [House] has a position, size and rotation which govern the rendering process. Think of them like the attributes an object has in a game engine. We might as well include more complex attributes, such as how old or yellow it is, i.e., adjectives.

Further, a [House] can have multiple floors. We can thus decide if a [House] should be drawn as [Roof][Ground Floor] or as [Roof][Floor][Ground Floor]. These different possibilities are called the rules of a grammar.

Primitives

Let’s focus on a [Square] for now. It has attributes of its own, but we still need a way to draw it to the screen. So, we introduce a rendering function that does this. This is quite easy for the primitives we chose. We’ll refer to this rendering function as the decoder, as it basically decodes the attributes into actual pixels.

Obviously, each symbol higher up in the hierachy also needs some sort of decoder that converts its own attributes into valid ones for those that follow. For example the decoder for the rule [House] → [Roof][Ground Floor] would take the house’s position located at the center and calculate the center position for the roof and ground floor.

Very nice. We can draw images. The exact opposite of what we actually wanted to do… So, let’s flip the entire thing around.

“Flipping” the Parse-Tree around.

But we must flip a lot more things than just the connections. First, we must invert the decoder, so that it becomes an encoder.

Primitive Capsules

The encoder of a [Triangle] takes in the pixel values and tries to find the valid attributes for that triangle. Easy enough. Let’s just take some some regression model and plug that in. We can even generate all the training data ourselves, as we know how to draw the triangle using the attributes.

But that’s not good enough. What if its shown a picture of a circle instead of a triangle? The regression model doesn’t care, it will just produce weird attributes. We need a way to verify that the image is truly a triangle.

How about, we take the attributes the regression model produced, render a triangle with it and check if those images agree. If it was a triangle in the first place, we should have agreement, otherwise not!

So, we are encoding and then decoding.. That’s just an auto-encoder! Well, not entirely, as we hand-crafted the decoder. But still… We then just add a small agreement function to it, which gives us a probability of how likely the image is a triangle.

A look at the internals of a primitive capsule. An example of agreement (a [Triangle] capsule shown a triangle).

An example of disagreement (a [Triangle] capsule shown a circle).

We now have both, all the attributes and the probability from the agreement. Neat. Let’s call this inverted symbol with all its internals (encoder, decoder, agreement function, …) a capsule. More specifically a primitive capsule, as it represents the graphical primitives. These primitive capsules follow the simple rule: If we can render it, we can de-render it… at least into something that produces the same image.

Semantic Capsules

For the next layers of capsules, things are a bit more complicated. These take in the primitives from the primitive capsules and check if they are the correct parts for the object.

A Semantic Capsule ([House]) activating based on the output of the Primitive Capsules ([Triangle] and [Square])

We don’t know how the decoder works that takes the ouputs of the primitive capsules and tells us the attributes of an object. Thus, we can’t train an encoder… What if we pretend we know what the encoder looks like? Seems strange, right… But consider our [House], which is made up of a [Triangle] for the roof and a [Square] representing the ground floor. During a forward pass, we know all attributes of the square and triangle. The size of the entire house can then just be taken as the sum of the sizes of its parts. For the position and rotation of the house, we just take the mean of both the square and the triangle. This same process goes for all other attributes, such as color, where we can just take the mean. An old ground floor (1.0) with a new roof (0.0) just looks like a mid-age house (0.5). This is the rough idea: We assume, at least at the beginning, that our encoder is similar to a mean function.

Obviously, with such a general mean function, any configuration of [Triangle] and [Square] would make a valid [House]. We don’t want that. Let’s again create an encoder-decoder pair with an agreement function. This time, we need to train the decoder instead of the encoder, but we’ll train it on real houses. Now, each time a configuration of squares and triangles is passed to the [House] capsule, it encodes it and then attempts to recreate the house from those attributes. If the result somewhat agrees with the original, then its a house. Otherwise its not.

A look at the internals of a semantic capsule.

Let’s call this type of capsule that checks part configurations, i.e., the agreement of semantics, a semantic capsule.

Routing

Now, I’ve mentioned that the symbols can have multiple different rules on how they are produced (A house with one or two floors, etc.). We need some way to do that with our capsules. What if we allow each capsule to have multiple such encoder-decoder pairs and check which one fits the best for the current image? Yeah, let’s do that! And let’s call each such possible pair a route. We’ll refer to the entire process as routing-by-agreement.

Detailed look at the internals of a Capsule (Primitive and Semantic). Note the different routes highlighted in yellow.

(And here is why we call these things capsules. The idea of matching the inputs and finding the correct route is based on the capsule architecture introduced by Sabour, Frosst and Hinton. The capsules presented here, however, are a bit different and we refer to them as neural-symbolic capsules to not get things confused)

Each of the presented Capsules is essentially a small container for neural networks.

Lifelong Meta-Learning

Next, you might have noticed that each of the capsules presented here is independent of every other capsule. And you are right! Training one capsule has no effect on the others. We can add and move around capsules as we please (checking that the attributes match of course). How about we let the entire network grow with each new object it encounters!

First we need to find a way to determine when the network actually encounters something it does not understand. As it is based on grammar, this is quite easy. Everytime the individual capsules activate and detect an object, there is some “root” object, i.e., one object that represents the entire image. In grammar terms, this is called an axiom. In our capsule network, this “root” does not need to be at the highest layer. If the image just contains a [Roof] the [House] capsule won’t activate, but [Roof] will still be the “root” that fully represents the image. We’ll refer to this observed “root” as an axiom as well.

However, multiple such axioms can activate in a single image. Take a [House] and a [Garage] which both have activated but do not share a common parent. We thus seem to have two axioms, which we don’t allow. If this happens, it means we encountered a new object or scene to which all these axioms are mere parts, such as [House] and [Garage] are parts of an [Estate].

Consider a different example and let’s again take our network that has learned how to detect [House]. It knows how to deal with a ground floor. But what if there are more? What if there are five additional [Floor] activations? Then the [House] capsule won’t activate. Sure, it can generate some attributes for the ground floor plus one of the five floors, but its agreement function will notice that the roof is way out of place. We must add a new capsule that acts as the new axiom for these activated capsules, such as an [Appartement] capsule.

Slightly different example of capsules activating (blue), but there is a shared axiom missing (left). This is rectified by adding a new capsule that acts as the parent to all previously dangling capsules (right).

Note, however, that we do not halt at distinct objects, but also describe scenes. [Office], [Singapore Orchard Road], [Baseball Game], all are perfectly valid. Imagine a picture with 10 houses. We would have 10 [House] acrtivations. This, again, violates the “only one axiom” rule we have. So, we create one called [Town Road Scene].

Of course, this decision requires some creativity. Coming up with names for now capsules or attributes is still a human task. What the capsule network can do, however, is pose a question and the network will respond with meta-learning. Like the following:

Q: “These floors and roof look like a house, but what is it?” ([House] did not activate even though multiple [Floor]s and [Roof] were present)

Now, this can trigger different responses by the human. Each indicating a different meta-learning procedure.

A.1: “It’s a House.” (Meta-learning trains a new route for the existing [House] capsule)

A.2: “It’s an Appartement.” (Meta-learning trains a new [Appartement] capsule)

B.1: “It’s an old House.” (Meta-learning continues training the existing attribute “old” with new data for the [House] capsule)

B.2: “It’s a rich House.” (Meta-learning trains a new attribute “rich” for the existing [House] capsule)

If we collect enough of these decisions by a human, we can train a decision matrix for the capsule network. Then, once it has learned enough responses, the network can make these decisions on its own and refine its original question or even answer it itself!

An excerpt of a decision matrix. Each of those features says something about the current activations in the network, the details of which are unimportant at this point. For all features that evaluate to “True”, the values on the right are added up and the column with the highest value is equal to the decision (A.1 — B.2). Training this matrix simply means adding “1” to the column that was chosen by the human for the rows that evaluate to “True”.

But what do we train these newly added capsules, routes or attributes with? With the data that it just received? Sure, but that’s not enough now, is it? How about we augment that data! This is actually quite easy in our case. We have access to all the attributes after all. Moving the appartement around (adding to the position attributes of the appartement and the produced floor/roof symbols) doesn’t change the fact that its an appartement and we can generate huge amounts of data doing just that. Neither does rotating or resizing change the apartement. We can do this to almost all attributes. If each floor has an old or yellow attribute, we can shift those all at once and augment the training set. For an old appartement we would assume that all floors and the roof are old, but it would still remain an appartement.

Obviously we will make mistakes with this rough augmentation, but its a start. Better than nothing considering we only have one image to go by. Single-shot is difficult, even when we are cheating.

Now, next time the network sees an appartement, it will… maybe understand it. But we can always add more training data as the network encounts the same type of image, again using our augmentation strategy. This isn’t too intensive computationally, as we isolated each capsule and don’t need to retrain the rest of the network. And, slowly, it will understand… This entire meta-learning process is very much like teaching a toddler what objects are. And as with toddlers, it is subjective and depends very much on the what and in what order the parent teaches it. Take, for example, the following two capsule networks trained in a different way, but reaching the same conclusion:

Two capsule networks meant to train for an asteroids like environment, but both end up with a different network configuration.

The difference is subtle, but as time goes on, those two networks will diverge further and further from each other.

We now have an entire network that can detect various objects, learn to detect new ones over time and even gives us attributes for it.

But we can also do some other neat things! We can

generate a semantic network of the scene! I.e., the parse-tree of the underlying grammar.

re-use the capsule network and simply expand on it! No need to retrain the old stuff and allow for some federated learning.

use the capsule network in reverse (as the original grammar) and generate images with it! It’s just an inverse graphics engine under the hood.

use the attributes to segment the image! We know all the sizes and positions.

generate basic descriptions of the image! After all, a capsule’s symbol is just a noun ([House], …), an attribute is just a preposition (position, rotation, …), an adjective (old, yellow, …) or a verb (explored in a later part), and their magnitude can be interpreted as an adverb (0.0 = not, very = 1.0, …)

do simple style transfers! Using multiple rules/routes, a [Door] can be drawn as a square (abstract) or as an actual door (real) and this automatically transfers to any object that has a [Door] as its part.

do physics! Inverse-simulation anyone?

Conclusion

I’ve left out a lot of the details found in the paper, such as how the network and the grammar deal with occlusion, how exactly the meta-learning agent expands on the network, the “observation table” to avoid using multiple copies of the same capsule and much more. But I hope I was able to convey a general idea.

In the next part, I want to explore how physics works using this approach ( https://arxiv.org/abs/1905.09891) and how the capsule network evolves from an inverse-graphics-engine to an inverse-simulation-engine, hopefully one day ending with an inverse-game-engine.

And, yes. Most of the results so far are far from impressive and it remains to be shown if this can even be extended to real images instead of the MSPaint-level of inputs it currently works on. Kosiorek et. al. developed a different method based on similar principles called “Stacked Capsule Autoencoders” ( https://arxiv.org/abs/1906.06818), which are definitely worth checking out if this sort of thing interests you!

My main hope is that this slightly different approach was both entertaining and perhaps gave you some ideas you could explore, by incorporating more old-school symbolic methods.