Hetero-Associative Memories for Non Experts: How “Stories” are memorized with Image-associations

Think about walking along a beach. The radio of a small kiosk-bar is turned-on and a local DJ announces an 80’s song. Immediately, the image of a car comes to your mind. It’s your first car, a second-hand blue spider. While listening to the same song, you drove your girlfriend to the beach, about 25 years ago. It was the first time you made love to her. Now imagine a small animal (normally a prey, like a rodent) roaming around the forest and looking for food. A sound is suddenly heard and the rodent rises its head. Is it the pounding of water or a lion’s roar? We can skip the answer right now. Let’s only think about the ways an animal memory can work.

Computer science drove us to think that memories must always be loss-less, efficient and organized like structured repositories. They can be split into standard-size slots and every element can be stored into one or more slots. Once done, it’s enough to save two references: a pointer (a positional number, a couple of coordinates, or any other element) and the number of slots. For example, the book “War and Peace” (let’s suppose its length is 1000 units) can be stored in a memory at the position 684, so, the reference couple is (684, 1000). When necessary, it’s enough to retrieve 1000 units starting from the position 684 and every single, exact word written by Tolstoj will appear in front of your eyes.

A RAM (Random Access Memory) works in this way (as well as Hard Disks and other similar supports). Is this an efficient storage strategy? Of course, it is, and every programmer knows how to work with this kind of memory. A variable has a type (which also determines its size). An array has a type and a dimensional structure. In both cases, using names and indexes, it’s possible to access to every element at a very high speed.

Now, let’s think about the small prey again. Is its memory structured in this way? Let’s suppose it is. The process can be described with the following steps: the sound is produced, the pressure waves propagate at about 330 m/s and arrive at the ears. A complex mechanism transforms the sound into an electric signal which is sent to the brain where a series of transformations should drive the recognition. If the memory were structured like a book-shelf, a scanning process should be performed, comparing each pattern with the new one. The most similar element has to be found as soon as possible and the consequent action has to be chosen. The worst-case algorithm for a memory with n locations has this structure:

For each memory location i: If Memory[i] == Element: return Element Else Continue



The computational cost is O(n), which is not so bad but requires a maximum number of n comparisons. A better solution is based on the concept of hash or signature (for further information see Hash functions). Each element is associated with a (unique) hash, which can be an integer number used as an index for an over-complete array (N >> n). The computational cost for a good hash function is constant O(1) and so the retrieval phase (because the hash is normally almost unique), however, another problem arises. A RAM-like memory needs exact locations or a complete scan, but instead of direct comparisons, it’s possible to use a similarity measure (that introduces some fuzziness, allowing to match noisy patterns). With a hash function, the computational cost is dramatically reduced, but the similarity becomes almost impossible because the algorithms are studied in order to generate complete different hashes even for very small changes in the input.

At the end of the day, this kind of memory has too many problems and a natural question is: how can animals manage them all? Indeed animal brains avoid these problems completely. Their memory isn’t a RAM and all the pieces of information are stored in a completely different way. Without considering all the distinctions introduced by cognitive psychologists (short-term, long-term, work memory, and so on), we can say that an input pattern A, after some processing steps is transformed into another pattern B:

Normally B has the same abilities of an original stimulus. This means that a correctly recognized roar elicits the same response of the sight of an actual lion, allowing a sort of prediction or action anticipation. Moreover, if A is partially corrupted with respect to the original version (here we’re assuming Gaussian noise), the function is able to denoise its output:

This approach is called associative and it has been studied by several researchers (like [1] and [2]) in the fields of computer science and computational neurosciences. Many models (sometimes completely different in their mathematical formulation) have been designed and engineered (like BAMs, SOMs, and Hopfield Networks). However, their inner logic is always the same: a set of similar patterns (in terms of coarse-grained/fine-grained features) must elicit a similar response and the inference time must as shortest as possible. If you want to briefly understand how these some of these models work, you can check these previous articles:

In order to summarize this idea, you can consider the following figure:

The blue line is the representations of a memory-surface. At time t=0, nothing has been stored and the line is straight. After some experiences, two basins appear. With some fantasy, if the image a new cat is sent close to the basin, it will fall down till reaching the minimum point, where the concept of “cat” is stored. The same happens for the category of “trucks” and so forth for any other semantic element associated with a specific perception. Even if this approach is based on the concept of energy and needs a dynamic evolution, it can be elegantly employed to explain the difference between random access and associate one. At the same time, starting from a basin (which is a minimum in the memory-surface), it’s possible to retrieve a family of patterns and their common features. This is what Immanuel Kant defined as figurative synthesis and represents one of the most brilliant results allowed by the neocortex.

In fact, if somebody asks a person to think about a cat (assuming that this concept is not too familiar), no specific images will be retrieved. On the contrary, a generic, feature-based representation is evoked and adapted to any possible instance belonging to the same family. To express this concept in a more concise way, we can say that we can recover whole concepts through the collection of common features and, if necessary, we can match these features with a real instance.

For examples, there are some dogs that are not very dissimilar to some cats and it’s natural asking: it is a dog or a cat? In this case, the set of features has a partial overlap and we need to collect further pieces of information to reach a final decision. At the same time, if a cat is hidden behind a curtain and someone invites a friend to imagine it, all the possible features belonging to the concept “cat” will be recovered to allow guessing color, breed and so on. Try yourself. Maybe the friend knows that fluffy animals are preferred, so he/she is driven to create the model on a Persian. However, after a few seconds, a set of attributes is ready to be communicated.

Surprisingly, when the curtain is opened, a white bunny appears. In this case, the process is a little bit more complex because the person trusted his/her friend and implicitly assigned a very high priority to all pieces of information (also previously collected). In terms of probability, we say that the prior distribution was peaked around the concept of “cat”, avoiding spurious features to corrupt the mental model. (In the previous example, there were probably two smaller peaks around the concept of “cat” and “dog”, so the model could be partially noisy, allowing more freedom of choice).

When the image appears, almost none of the main predicted features matched with the bunny, driving the brain to reset its belief (not immediately because the prior keeps a minimum doubt). Luckily, this person has seen many rabbits before that moment and even, after all the wrong indications, his/her associative memories can rapidly recover a new concept, allowing the final decision that the animal isn’t a cat. A hard-drive had to go back and forth many times, slowing down the process in a dramatic way.

A different approach based on Hetero-Encoders

An hetero-encoder is structurally identical to an auto-encoder (see Lossy Image Autoencoders). The only difference is the association: the latter trains a model in order to obtain:

While an hetero-encoder trains a model that is able to perform the association:

The source code is reported in this GIST and at the end of the article. It’s based on Tensorflow (Python 3.5) and it’s split in an encoding part, which is a small convolutional network followed by a dense layer. Its role is to transform the input (Batch size × 32 × 32 × 3) into a feature vector that can be fed into the decoder. This one processes the feature vector with a couple of dense layers and performs a deconvolution (transpose convolution) to build the output (Batch size × 32 × 32 × 3).

The model is trained using an L2 loss function computed on the difference between expected and predicted output. An extra L1 loss can be added to the feature vector to increase the sparsity. The training process takes a few minutes with GPU support and 500 epochs.

The model itself is not complex, nor based on rocket science, but a few considerations are useful to understand why this approach is really important:

The model is implicitly cumulative. In other words, the function g(•) works with all the input images, transforming them into the corresponding output.

No If statements are present. In the common algorithmic logic, g(•) should check for the input image and select the right transformation. On the contrary, a neural model can make this choice implicitly.

All pattern transformations are stored in the parameter set whose plasticity allows a continuous training

A noisy version of the input elicits a response whose L2 loss with the original one is minimized. Increasing the complexity of both encoder and decoder, it’s possible to further increase the noise robustness. This is a fundamental concept because it’s almost impossible to have two identical perceptions.

In our “experiment”, the destination dataset is a shuffled version of the original one, therefore different periodic sequences are possible. I’m going to show this with some fictional fantasy. Each sequence has the fixed length of 20 pictures (19 associations). The first picture is freely chosen, while all the others are generated with a chain process.

Experience bricks

I’ve randomly selected (with fixed seed) 50 Cifar-10 pictures (through Keras dataset loading function) as building blocks for the hetero-associative memory. Unfortunately, the categories are quite weird (frogs, ostriches, deer, together with cars, planes, trucks, and, obviously, lovely cats), but they allow using some fantasy in recreating the possible outline. The picture collage is shown in the following figure:

The original sequence (X_source) is then shuffled and turned into a destination sequence (X_dest). In this way, each original image will be always associated with another one belonging to the same group and different periodic sequences can be discovered.

A few “stories”

The user can enlarge the dataset and experiment different combinations. In this case, with 50 samples I’ve discovered a few interesting sequences, that I’ve called “stories”.

Period = 2

John is looking at a boat, while a noisy truck behind him drew his attention to a concrete wall that was being built behind him.

Period = 7

While walking in a pet-shop, John saw an exotic frog and suddenly remembered that he needed to buy some food for his cat. The favorite brand is Acme Corp. and he saw several times their truck. Meanwhile, the frog croaks and he turns his head towards another terrarium. The color of the sand and the artificial landscape drive him to think about a ranch where he rode a horse for the first time. While trotting next to a pond, a duck drew his attention and he was about to fall down. In that moment the frog croaked again and he decided to harry up.

Period = 7

John is looking at a small bird, while he remembers his grandpa, who was a passionate bird-watcher. He had a light-blue old car and when John was a child, during a short trip with his grandpa, he saw an ostrich. Their small dog began barking and he asked his grandpa to speed up, but he answered: “Hey, this is not a Ferrari!” (Ok, my fantasy is going too fast…). During the same trip, they saw a deer and John took a photo with his brand new camera.

Period = 20 (two stories partially overlapped)

Story A:

Story B:

These two cases are a little bit more complex and I prefer not fantasize. The concept is always the same: an input drives the network to produce an association B (which can be made up of visual, auditory, olfactory, … elements). B can elicit a new response C and so on.

Two elements are interesting and they worth an investigation:

Using a Recurrent Neural Network (like an LSTM) to process short sequence with different components (like pictures and sounds)

Adding a “subconscious” layer, that can influence the output according to a partially autonomous process.

Source code

View the code on Gist.

References

See also:

Lossy image autoencoders with convolution and deconvolution networks in Tensorflow – Giuseppe Bonaccorso Fork Autoencoders are a very interesting deep learning application because they allow a consistent dimensionality reduction of an entire dataset with a controllable loss level. The Jupyter notebook for this small project is available on the Github repository: https://github.com/giuseppebonaccorso/lossy_image_autoencoder.