I’ve been spending the last few days to understand what is sparse coding, Here are my results!

We won’t dive hard in any theoretical math, this article is trying to explore the sparse coding notion and see their impact on neural networks: I prefer to experiment with a notion and implement some concrete hypothesis than just playing with the math.

Please, if you find any errors, contact me: I’ll be glad to hear from you :)

Finally, if you more of a code fan, go straight to our Github version of this blog post:

What is Sparse coding?

The mathematical standpoint

Sparse coding is the study of algorithms which aim to learn a useful sparse representation of any given data. Each datum will then be encoded as a sparse code:

The algorithm only needs input data to learn the sparse representation. This is very useful since you can apply it directly to any kind of data, it is called unsupervised learning

It will automatically find the representation without losing any information (As if one could automatically reveal the intrinsic atoms of one’s data).

To do so, sparse coding algorithms try to satisfy two constraints at the same time:

For a given datum as a vector x , it will try to learn how to build “useful” sparse representation as a vector h using an encoder matrix E.

, it will try to learn how to build “useful” sparse representation as a vector using an encoder matrix For each representation as a vector h , it will try to reconstruct the datum x using a decoder matrix D.

, it will try to reconstruct the datum using a decoder matrix D and E are learnable parameters.

The mathematical representation of the general objective function for this problem can help:

where:

N is the number of datum in the data

is the number of datum in the data x_k is the k given vector of the data

is the given vector of the data h_k is the sparse representation of x_k

is the sparse representation of D is the decoder matrix (more on that later)

is the decoder matrix (more on that later) lambda is the coefficient of sparsity

is the coefficient of sparsity C is a given constant

The general form looks like we are trying to nest two optimization problem (The double “min” expression). I prefer to see this as two different optimization problem competing against each other to find the best middle ground possible:

The first “min” is acting only on the left side of the sum trying to minimize the reconstruction loss of the input by tweaking D.

The second “min” tries to promote sparsity by minimizing the L1-norm of the sparse representation h Put simply, we are just trying to resolve a problem (in this case, reconstruction) while using the least possible amount of resource we can to store our data.

Note on the constraint on D rows

If you don’t apply any constraint on D rows you can actually minimize as much as you want while making D big enough to compensate. We want to avoid this behaviour as we are looking for vectors containing actual zero value with only as few as possible non-zeros and big values.

The value of C doesn’t really matter, in our case, we will choose C = 1

The vector space interpretation

If you try to find a vector space to plot a representation of your data, you need to find a basis of vectors.

Given a number of dimensions, sparse coding tries to learn an over-complete basis to represent data efficiently. To do so, you must have provided at first enough dimensions to learn this over-complete basis.

In real life, you just give more (or same amount) dimensions than the number of dimensions the original data are encoded in. For example, for images of size 28x28 with only 1 channel (grey scale), our space containing our input data has 28x28x1=784 dimension

Why would you need it to be over-complete?

An over-complete basis means redundancy in your basis, and vectors (while training) will be able to “compete” to represent data more efficiently. Also, you are assured, you won’t need all dimensions to represent any data point: In an over-complete basis, you always can set some dimensions to 0.

Features will have multiple representations in the obtained basis, but by adding the sparsity constraint you can enforce a unique representation as each feature will get encoded with the most sparse representation.

The biological standpoint

From the biological standpoint, the notion of sparse code (and sparse coding) comes after the more general notion of neural code (and neural coding).

Wikipedia says:

Neural coding is concerned with characterizing the relationship between the stimulus and the individual or ensemble neuronal responses and the relationship among the electrical activity of the neurons in the ensemble.

Let’s say you have N binary (for simplicity) neurons (which can be biological or virtual). Basically:

You will feed some inputs to your neural networks which will give you an output in returns.

To compute this output, the neural network will strongly activate some neurons while some others won’t activate at all.

The observation of those activations for a given input is what we call the neural code and the number of activations on some data is called the average activity ratio

Training your neurons to reach an acceptable neural code is what we call neural coding.

Now that we know what a neural code is we can start to guess what neural code could look like. You could have only 4 big cases:

No neuron activate at all

Only one neuron gets activated

Less than half of the neurons get activated

Half of the neurons get activated

Let’s put aside the case where no neurons activate at all as this case is pretty dull.

Local code

At one extreme, we have a low average activity ratio corresponding to local codes: Only one neuron gets activated per feature of the stimulus. No overlap between neurons is possible: one neuron can’t react to different features.

Note that a complex stimulus might contain multiple features (multiple neurons can fire at the same time for a complex stimulus)

Dense code

The other extreme corresponds to dense code: half of the neurons get activated per feature.

Why half (average activity ratio of 0.5)?

Remember we are looking at neurons in the ensemble, for any given distribution of activated neurons with an activity ratio > 0.5, you could swap activated ones with the inactivated ones without losing information in the ensemble. Remember that all neurons are used to encode information, to the opposite of sparse code, one neuron doesn’t hold any specific information in itself.

Sparse code

Everything in between those two extremes is what we call sparse code. We take into account only a subset of neurons to encode information.

A sparse code can be seen as a good compromise between balancing all the different characteristics of neural codes: computation power, fault tolerance, interference, complexity … . This is why from an evolutionary perspective, it makes sense that the brain works only with sparse code.

More on that in the why section.

Linking the two fields

How does neural code relate to the mathematical definition of sparse coding?

This two fields can be linked thanks to neural networks: You can interpret a neural network in terms of projections of information into vector spaces: Each layer of neurons will receive a representation of the data (in a vector space) from the previous layer and project it to another vector space.

If you interpret a layer as a data projection, what can be interpreted as the basis of the vector space?

It’s the neurons per layer: each neuron would represent a vector. if a neuron is activated in a layer: it means that you need this dimension to encode your data. The fewer neurons are activated per projection, the closer you are from a sparse representation of your data and a sparse code in the biological standpoint.

Choosing the number of neurons per layer is then equivalent to choosing the number of dimension of your over-complete basis which will represent your data. It is the same idea as word embedding when you choose the number of dimensions of the word vector space to encode your words for example.

Remember the mathematical equation

You can notice the presence of 2 unknown variables in the equation: D and h. In neural networks:

h = E . x (I avoid writing the bias terms for the sake of simplicity). With E a matrix

You can interpret the neural network weights as the weights of an encoder (trying to create the most sparse representation of the data it cans) and the neural network weight D as the weights of the decoder (Trying to reproduce the original content as best as it cans). Also, it happens to be what we call an auto-encoder.

Remark: How much neurons? you said. It seems impossible to tell without experiment, this is why we use cross-validation to optimize hyper-parameters. If the number of neurons (and so, dimensions) is greater than the intrinsic vector space of the data, then it means that you can infer an over-complete basis and so, you can achieve sparse coding in your neural network. If you don’t have enough neurons, you will compress your data with some loss, getting worst results when you try to reproduce the data.

Why not using as many neurons as we can? Because this is just too expensive and terribly inefficient.

Why having a sparse representation of data can be useful?

Intuitions

A good sparse coding algorithm would be able to find a very good over-complete basis of vectors representing the inherent structure of the data. By “very good”, I mean “meaningful” for a human:

It could be compared to:

Decomposing all the matters in the universe in a composition of atoms

Decomposing music in a composition of just a few notes

Decomposing a whole language in just a few letters And yet, from those basic components, you can generate all the music or all the matters or all the words in a language back.

Having an algorithm, which could infer those basic components just by looking at the data would be very useful because it would imply that we could pre-train a neural network to learn always useful features for any future task that you would need to solve.

Sparse coding brings loads of interesting characteristics for the brain which can be applied to computers too and are often overlooked.

Interference

This scheme of neural code avoids the risk of unwanted interference between neurons: One neuron can’t encode two features at the same time.

in other schemes (see dense/sparse code): one neuron might react to multiple features, if it happens that the 2 feature appears at the same time in the stimuli, one could imagine the emergence of a new signal from a neuron reacting to both information which could be misleading.

Parallelism

Since not all neurons are used at the same time and interference are unlikely in sparse code, neurons can work asynchronously and parallelise the computation of information.

It is important to note that parallelism is closely linked to interference, in a dense code, since all neurons are used to encode any piece of information, trying to parallelise computation leads to a great risk of interference.

Fault tolerance

Having multiple neurons firing for the same stimulus can help one to recover from an error coming from a neuron. Local code is especially sensitive to errors, and dense code can always recover.

Memory capacity

In local code, since one neuron represent a feature you can encode only as much as neurons.

If you think about neurons as a binary entity (1 or 0), we have two states. It means that for dense code we have actually 2^N different combination of neurons to choose from for each feature of the input data which is A LOT!

In sparse code, if we set d to be the average density of neurons activation per stimulus, we then have 2^d different combination this time. It’s interesting because since d can be a lot smaller than N, we have room for other characteristics.

Complexity And Decoding

Obviously, the more neurons you need to observe to detect what feature has been encoded, the more complex it will be to decode information.

Sum up

N: total numbers of neurons

d: average density of activations, we simplify the behaviour of neurons as activated or not

How to use it in neural networks?

The simplest known usage of combining neural networks and sparse coding is in sparse auto-encoder: It is a neural network that will try to mimic the identity function while under some constraint of sparsity in the hidden layers or the objective function.

After learning the concept, I wanted to try some experiments that I’ve never seen around. My idea was to use the ReLU activation combined with an L1 norm in the objective functions to promote sparse neurons activations: Let’s see (with code and TensorFlow) how far can we go with this idea.

Some experiments with TensorFlow

We will now try to validate our previous statement with some very simple neural networks. Multiple questions here:

Can we reach (at all) local code on the MNIST dataset?

Can we reach sparse code while keeping a good accuracy?

Do all neurons get activated through the complete dataset?

Experiments:

A simple fully connected neural network used as a classifier of digits on MNIST, I add the sparsity constraint with L1-norm directly on the hidden layer

I will test with different coefficients of sparsity, especially 0, giving us a baseline without sparsity

Let’s move forward with a CNN, doing the same experiment as before (l1-norm on the activations of the feature map this time). Since we are trying to classify images, we should see an improvement.

Finally, I will add a reconstruction layer and loss to our CNN architecture and train them jointly to see if any improvement happen

Fully connected neural network

I will do this experiment with only one hidden layer of 784 neurons because our input data is a 28x28 image so I’m sure I can re-encode the data in as many dimensions as the original space. Let’s see how sparse we can get!

All code files can be found in the repo

First, we load the MNIST dataset and prepare our placeholder for training/testing:

We add our neural layer and calculate the average density of activations:

We add the SoftMax layer:

We add the Loss layer: note that we add our sparsity constraint on activations

Now we are going to merge all the defined summaries and only then we will calculate accuracy and create its summary. We do that in this order to avoid graphing the accuracy every iteration

Now we finish with the training/testing part

Since we are talking about images, I’ve done another experiment using a convolutional neural network and 200 feature maps (Remember that the weight are shared between feature map in convolutional neural net, we are just looking for a local feature everywhere in the image)

Et voila! Let’s launch our little experiment (Do it yourself to see graphs with TensorBoard) Here is the summary for (20000 iterations):

Fully connected neural net (N = 784 neurons):