Capsule Networks for Food Classification

A Bitter Pill to Swallow

Overview

Most state-of-the-art computer vision systems are based on Convolutional Neural Networks (CNNs). However, one of the pioneers of Deep Learning, Geoffrey Hinton, has been researching a different architecture called a Capsule Network (CapsNet) for several years now. The first practical paper finally came out in October 2017, authored by Sara Sabour, Nicholas Frosst and Geoffrey Hinton.

Though the research in this area is still in its early stage and the published results are restricted to relatively simple datasets, Capsule Networks have shown a lot of promise.

In this blog post, we will embark on a journey of applying Capsule Networks to a more challenging domain of food classification, a task that pops up quite often for us at Cookpad. After a brief overview of Capsule Networks, we will focus on solving practical issues, analysing the results and comparing them to the good old Neural Networks that never fail to impress.

Food Classification Based on Images

To enhance our users’ experience, Cookpad ML team is developing multiple computer vision models which can detect the presence, location and type of food on an image or a video. For example, in the case of ‘dish classification’ the model should say that this photo taken a week ago in our Bristol office celebrating Pancake Day is indeed a pancake:

Of course, we are not the only ones in the world developing such models — food classification is a relatively well-researched topic. However, the models/data do not usually transfer between the subdomains nicely, and everyone has a unique use case for this technology.

For this blog post, we are going to use one of the public datasets available for research purpose named Food-101:

Example image from Food-101 public dataset, belonging to the class ‘Apple Pie’

Neural Networks from a Prediction Viewpoint

Note (too many words alert): in the next two sections (in grey) I am going to briefly discuss Capsule Networks and how they differ from regular neural networks. If you are familiar with Capsule Networks already, you can safely skip to ‘Adapting Capsule Networks for Food Classification’ section and drink a cup of tea instead of reading these sections.

Another note: if you are completely new to Neural Networks, what follows does not serve as an introduction to them. I would recommend to first read through one of the intro tutorials or videos available online, or better yet, if you have time, go through if you are completely new to Neural Networks, what follows does not serve as an introduction to them. I would recommend to first read through one of the intro tutorials or videos available online, or better yet, if you have time, go through Andrew Ng’s introductory Coursera course Let us spend a little bit of time recalling the basics of the neural network equations. We will put a slight twist on the well-known ideas which might help to understand Capsule Networks later. The most basic component of a neural network is a neuron (I know that’s not true anymore, you’ve got all these batch normalisations etc., but bear with me). Here are the equations to compute the value of the neuron 𝒂 based on the input x:

𝒂 is simply a float number, z is also a number, b is also a number, x is a vector of n neurons from the previous layer in the network, hence its size is n⨉1. W is a matrix of size 1⨉n. f is some nonlinear scalar function e.g. Pay attention to the dimensions:is simply a float number,is also a number,is also a number,is a vector ofneurons from the previous layer in the network, hence its size isis a matrix of sizeis some nonlinear scalar function e.g. sigmoid function Now let’s look at this simple equation from another perspective. Imagine that the input neurons are not just numbers, but very smart creatures instead. Imagine further that they are predicting what the value of z will be with the help of the weight matrix W. Assume b=0. You can observe that every neuron x_i could be seen as predicting the value n·W_0i·x_i, and then Eq.1 represents averaging those predictions to compute z. Indeed,

Yes, I know, there is this annoying b hanging around which does not always want to be zero, but you can always see it as yet another neuron x_n=1 with the weight W_0n = b. Sounds like too complicated of a way to explain what a simple linear function does, isn't it? But this idea of predictions features heavily in Capsule Networks, and it will become useful when we upgrade our neuron to a capsule.

Capsule Networks

In this section I am going to explain briefly the idea of a Capsule Network as described in Sabour et. al. (2017) paper Dynamic Routing Between Capsules . If you would like a more thorough explanation, I encourage you to have a look at the original paper and/or blogs explaining the concepts in detail, e.g. Max Pechyonkin’s post In Capsule Networks the basic computational unit is not a neuron anymore, but a capsule — essentially a multidimensional vector. Now, instead of a number 𝒂, you have a vector c of some small dimension K, for example with K=16. If one thinks of this capsule c as representing some entity, say an apple or a pear, then each dimension c_j, j=1,..,K would represent different qualities of an apple. Maybe, the first couple of dimensions could be the orientation of the apple stalk, another dimension could show how big the apple is etc. In reality your network would rarely learn such nice features even for simple objects, but it is a nice ideal to be striving for. In the original paper, Sabour et al. (2017) suggested to use the size/length of the capsule to represent the probability of an entity (our apple) being present, but it is not the only choice available — you can model such probability in different ways as the authors pointed out. So how would we transform the ‘neuron’ equations (1)–(3) to the capsule case? Let’s remember our witty prediction model. Taking it to the capsule level, every input capsule x_i generates a prediction pred_i for the output capsule:

Here, x_i is a K-dimensional capsule vector, W_i is a K⨉K matrix of learnable weights specific to input capsule x_i, and pred_i is a K-dimensional prediction of what the output capsule value might look like. We now have one K-dimensional prediction vector for each input capsule. In the neuron case, the predictions were just averaged (Eq. 1), and sigmoid nonlinearity was applied afterwards (Eq. 2). Here, we do not restrict ourselves to averaging, and the output capsule will have some magical nonlinearity applied to the collection of the predictions to compute the value of the output capsule (we’ll discuss it below in more details):

Above, c is a K-dimensional value of the output capsule. And then, of course, there could be many output capsules — the nonlinearity can also couple the output capsules with each other. A reasonable question can be asked by a skeptical reader at this point — why bother? Aren’t Neural Networks already highly successful at function approximation? Why making it more complicated with all these extra dimensions? Below we outline some reasons on why it might be a worthy idea after all. The first motivation comes from neuroscience. As it turns out, our brain cortex is not just a big blob of grey matter consisting of billions of crazily interconnected neurons. These neurons are, in fact, organised into structures called cortical columns (read capsules). As an aside note, there are interesting theories about what cortical columns are actually computing. You can read about one such theory in a book On Intelligence , written by an individual with quite a unique professional career named Jeff Hawkins. He also turns out to be the founder of Palm Computing and one of the inventors of Palm Pilot. Incidentally, his theory is all about cortical columns predicting what other columns are doing, albeit in a slightly different sense than what we discuss here. The second motivation comes from Dr. Hinton’s grand plan to build a framework for inverse graphics (see his rather technical talk ). What does it mean? For example, in computer graphics you can estimate the pose (read capsule) of a hand w.r.t. the pose (capsule) of a body by a viewpoint-invariant matrix multiplication. Now what Dr. Hinton wants to achieve is to estimate poses the other way around — given the poses for the parts, estimate the pose of the whole object. Capsules then present a natural way to organise the computation. Finally, there are purely mathematical benefits brought by using capsules as opposed to neurons. More specifically, the ‘magical nonlinearities’ in the capsule case could be much more interesting and elaborate than nonlinearities in one-dimensional neuron case. For example, here is one interesting nonlinearity. Instead of averaging and applying per-component sigmoid function as in equations (1)-(2), one can do the following.

An example of magical nonlinearity working for a 2-dimensional capsule with dimensions d_1 and d_2. The data points on the plot are the predictions from a number of input capsules. The nonlinearity identifies the cluster in red colour and computes the value of the output capsule as the centre of that cluster. Inspired by Geoffrey Hinton’s talk.

This is really powerful. You can have thousands of input capsules, but in multiple dimensions the likelihood that they actually predict the same answer just by chance alone is quite low. So if the output capsule sees a cluster, it could think ‘wow! that must be it, those couple of smart input capsules figured it out!’ and discard everything else. Amazing things you can do in multiple dimensions. Such a nonlinearity is the basis for dynamic routing between capsules, and it could be implemented in multiple ways. In Sabour et al. (2017) paper it is done by repeatedly computing dot products between the weighted average of predictions and the predictions themselves to update the weights, while restricting the output capsule length between 0 and 1 with a nonlinear scaling factor (with an additional constraint that the prediction weights for one input capsule should sum up to 1 across all output capsules, thus coupling the output capsules together). But the implementation details are less important here than the idea — the capsule nonlinearities identify clusters of predictions from the input layer at runtime.

Adapting Capsule Networks for Food Classification

Enough talking, let’s do something!

We will apply Capsule Networks to Food-101 image classification task, stripping out all the extra regularizers to reduce the number of moving parts. We will compare Capsule Networks with a roughly equivalent version of CNNs applied to the same task.

Disclaimers: In this blog post we are not trying to ‘improve’ the Capsule Networks or ‘scale’ them. The goal is much more modest — adapt some of the ideas to the image classification task in a food domain. In doing so, we inevitably changed Capsule Networks in a way inconsistent with G. Hinton’s and his collaborators’ direction.

In addition, while we tried to be fair at comparisons, it is definitely not rigorous enough for a scientific study, so please take the results and conclusions with a pinch of salt.

Dataset

Food-101 dataset was released by Bossard et. al. in 2014. It consists of 101,000 images split into 101 different classes of food, 1000 images each (750 images for training, 250 for testing). Some examples of food classes are apple pie, eggs benedict and risotto.

While this dataset cannot help us build models at Cookpad, it is publicly available for research purpose and thus it will be possible for others to replicate and improve the results we report here.

Experiments

Fortunately, we live in an amazing time for machine learning research. Consider this: Dynamic Routing Between Capsules paper came out in October 2017. Within a couple of days, blogs and videos explaining what capsule networks are started popping up. Within a couple of weeks a paper applying Capsule Networks to another dataset (CIFAR-10) was uploaded on arXiv. By February 2018 (which is when I am writing these lines) on GitHub there exist multiple implementations of this work in Keras; Tensorflow; PyTorch; Torch; Chainer; Matlab; MXNet; Numpy. All of this work was done mostly by researchers not connected with the authors of the original paper. That’s not all. The next paper on the topic, Matrix Capsules with EM Routing, is not even published yet, but it is in open double blind review, and as everyone is pretty sure this paper is by G. Hinton’s group, there are already blogs and even implementations of that paper created by other researchers.

So what’s left for us? As we started, it looked like not much. As Keras is our weapon of choice, we chose an excellent implementation of Capsule Networks by Xifeng Guo to start with, but we also looked at other versions such as this one, also derived from the same codebase. Xifeng Guo already applies the network to 28⨉28 grayscale images of handwritten digits with 10 classes.

We decided to minimize the margin loss described by Sabour et. al. (2017) rather than a more standard cross-entropy loss to give an edge to capsule networks which were evaluated (and hence somewhat tuned) on this loss already — you need to give a head start to new approaches. So all that remains for us to do is to get rid of regularization played by the reconstruction loss, convert our images to size 128⨉128⨉3 ignoring the aspect ratios (who cares about aspect ratios), change the output layer to contain 101 capsules for each output class instead of 10, and voila!

Sadly, my beefy AWS box choked and refused to proceed. Let’s have a look at what happened.

In this configuration, the capsule size K equals to a sensible number 16. Since we have 101 output classes, we need 101 output capsules. On the layer before that, we have a grid of size 128⨉128. For each point on the grid, we have 32 types of capsules (an equivalent of channels for CNNs) — maybe not enough given that we got 101 output classes, but oh well. Now, each channel is one capsule, and for each of those we need to connect them to each of the 101 output capsules with a matrix of K² = 16² = 256 entries (see Eq. 4). In total, we have 128⨉128⨉32⨉101⨉256=13,555,990,528 parameters between two capsule layers. For comparison, ResNet-152, one of the the most advanced massive CNNs used for fitting huge ImageNet dataset, has only about 60MM parameters, which is 225 times smaller. Note: the actual calculation is slightly more elaborate, so to avoid confusion we used these numbers; the point still stands.

How can we reduce this number of parameters? There are many obvious things we can do to address the issue.

Firstly, in that unpublished paper Matrix Capsules with EM Routing following up on the original one, one of the new ideas was to represent the capsule as a matrix rather than a vector. What does it mean? Say, we’ve got a 16-dimensional vector as a capsule. If we instead treat is as a 4⨉4 matrix, then to make a prediction of size 4⨉4 for an input capsule of size 4⨉4 we need to multiply the input capsule matrix by a weight matrix also of size 4⨉4!

So we need only K parameters instead of K² if we represent capsules as matrices. Certainly the smaller matrices will be less expressive, but I think we can all live with that. We have implemented this change, tested it on MNIST and verified that it still works reasonably well. Now we are of course mixing up different approaches — we treat capsules as regular vectors when we compute the magical nonlinearity, but treat them as matrices when computing the predictions. Sigh!

All right, now we are down to 13,555,990,528 / 16 = 847,249,408 parameters. Better, but still too many. If you have a closer look, the worst part is clearly 128⨉128=16,384 cells in our image — without these cells, we would be back to sensible numbers. In fact, if you run CapsNet on CIFAR-100 dataset (which is like CIFAR-10, but with 100 output classes instead of 10), it already starts working because the input image is much smaller — 32⨉32. But 128⨉128 is over the top.

Many of you would say at this point — how about convolutions, as proposed by the authors themselves? Indeed, instead of routing all the image capsules to all the output capsules, we can put a couple of ‘capsule convolution’ layers in between, similar to the ideas expressed in the Matrix Capsules with EM Routing paper. The convolutions in the capsule case are going to be slow though, because you need to compute the magical nonlinearity for each patch on the image.

There is a faster alternative widely used in CNNs to solve exactly this problem — average pooling. In this format, we simply average all the capsules across the grid per channel. Thus, we lose all the positional information from the capsules, but save up on the parameter number — it is now only 32x16x101=51,712. This is rather disturbing from a theory point of view — global average pooling totally diverges from G. Hinton’s idea on how the capsules should interact, as the positional information is critical. But applying average pooling on MNIST showed that at the very least it still fits the data well, and the magical nonlinearity still makes massive difference. Indeed, we were so afraid to lose the magical nonlinearity that when making any change I was constantly turning it off and on and checking that the performance drops. And it did drop.

After applying average pooling, we are back to a reasonable number of parameters and our nice little network starts learning! Baby steps, but very exciting to see. After a couple of iterations we reached the accuracy of 16% on a test set. While not great, it is still much better than the 1% you would get from randomly guessing one of 101 classes.

So the network is learning something. However, even in 2014 (ancient times) Bossard et. al. reported 56% accuracy on the test set, and a few published results since then beat the baseline by far. To keep the analysis simple, we intentionally did not apply lots of techniques that could help us, e.g. data augmentation, but we want to at least get to that ballpark to feel a bit better about ourselves.

Well, the first thing we could do is to wait longer and see how the network converges, then tuning it and modifying it iteratively. Unfortunately, this is rather slow! The main bottleneck here is actually not the magic nonlinearity anymore, but the layers before that. Even after using separable convolutions it still took 20 minutes per epoch to run on a Tesla K80. Not a big deal of course, and one could still go ahead, but can we somehow get there faster?

If we would be training a regular CNN, the very first thing to try would be transfer learning with a large network such as ResNet pre-trained on ImageNet dataset. Why not follow the same strategy here?

In transfer learning, we would use a large pre-trained network and stick a smaller network on top of those layers to fit our own (smaller) dataset. In this way, the larger network layers play the role of feature extractors (which could also be tuned, but we are not going down that road). Unfortunately, using good feature extractors from a pre-trained CNN goes against G. Hinton’s ideas, who suggested that capsules should start operating on fairly low-level representations. But we stubbornly go ahead anyways, throwing all the theory out of the window. Sigh again.

After we stack our CapsNet on top of a pre-trained MobileNet (we love MobileNet because it is awesome), it easily reaches 53% accuracy and 0.36 loss on a test set before starting to overfit. Does the magical nonlinearity still help? Yes, turning it off makes everything converging quite slowly and the performance on the test set is worse too (loss 0.40 and 51% accuracy). Phew.

This is a good time to bring CNNs back into the game. A standard MobileNet-based CNN of the same structure and approximately the same number of parameters was fitting the training data faster, but generalised worse, reaching the loss of 0.38 and 52% accuracy. This suggests, quite reasonably, that in this context Capsule Networks structure works somewhat as a regularizer.

Now we are in a position to run the experiments reasonably quickly and check a few interesting points. For example, recall that in Sabour et. al. (2017) version of magical nonlinearity the predictions of each input capsule should sum to 1 across the output capsules. It essentially represents the constraint that an object can belong to one of the bigger objects in the scene but not to several of them at the same time. We have modified the algorithm by removing this constraint and observed that the test set performance was marginally worse (0.36 loss and 52% accuracy). It suggests that this constraint also helps to regularize the network, although it does not seem to be critical for the network’s performance.

We have also experimented with adding convolutional capsule layers before the final layer using a naive implementation. Interestingly, we faced some issues with the nonlinearity from Sabour et. al. (2017), as the gradients duly exploded when we stacked things up. This could be mitigated by using gradient clipping or normalization, but the performance suffers. We have tried various versions of the magical nonlinearity, and those that worked well with convolutions also showed to be more similar in performance to CNNs, with quicker convergence and worse generalisation. However, the problems could be due to my faulty implementation rather than some more fundamental issue, as we haven’t had a chance to understand thoroughly what has happened yet. In any case, in the Matrix Capsules with EM Routing paper the authors used convolutions successfully albeit with a different routing mechanism.

Results and Conclusions

Here is a summary of the experiments that we ran, aiming to achieve the best generalisation performance on the margin loss.

Pooling means using global average pooling. -constraint means the constraint that couples output capsules is removed. -magical_nonlinearity means Sabour et. al. (2017) nonlinearity was turned off. Convolutions* represents our best effort at stacking the capsule layers in convolutions.

To our surprise, Capsule Networks were able to perform competitively and even outperform comparable CNNs in our setting. Of course, the setting is rather restrictive, and many improvements that could be made to CNNs might not, and likely will not, work for CapsNets equally well. Indeed, the performance we were analysing is far below what the combination of modern techniques can do on this dataset, e.g. Aguilar et at. (2017) report >80% accuracy.

Yet, these results confirm that the ideas behind CapsNets are widely applicable even when their application deviates significantly from the way they have been designed to work. You can compose capsule networks easily with pre-trained CNNs and train reasonable classifiers. In all cases, the dynamic routing mechanism (i.e. magical nonlinearity) seems to work well and improves the generalisation performance of the network.

The main question that remains is whether you can make CapsNets beat the best of the best of what a pure CNN world can offer in a fair fight, preferably on a massive task like ImageNet Challenge. And can you make them run more efficiently? We will probably learn the answers to these questions very soon as some very smart people are working to figure them out.

References

Non-exhaustive list of resources used.

Papers/Books

[1] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules. Advances in Neural Information Processing Systems 30, pages 3856–3866, 2017.

[2] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 — mining discriminative components with random forests. In European Conference on Computer Vision, 2014.

[3] Anonymous. Matrix capsules with EM routing. In review, 2018.

[4] Edgar Xi, Selina Bing, and Yang Jin. Capsule network performance on complex data. CoRR, abs/1709.04864, 2017.

[5] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.

[6] Eduardo Aguilar, Marc Bolaños, and Petia Radeva. Food recognition using fusion of classifiers based on CNNs. Image Analysis and Processing — ICIAP 2017, pages 213–224, 2017.

[7] Niki Martinel, Gian Luca Foresti, and Christian Micheloni. Wide-slice residual networks for food recognition. CoRR, abs/1612.06543, 2016.

[8] Geoffrey E. Hinton, Alex Krizhevsky, and Sida D. Wang. Transforming auto-encoders. In Proceedings of the 21th International Conference on Artificial Neural Networks — Volume Part I, ICANN’11, pages 44–51, Berlin, Heidelberg, 2011. Springer-Verlag.

[9] Jeff Hawkins and Sandra Blakeslee. On Intelligence. Times Books, 2004.

Blogs/Videos

Max Pechyonkin’s post explaining Capsule Networks

Adrian Colyer’s blog about Matrix Capsules with EM Routing

Aurélien Géron’s Capsule Networks video tutorial

Geoffrey Hinton’s talk “What is wrong with convolutional neural nets?”

Implementations

https://github.com/XifengGuo/CapsNet-Keras

https://github.com/theblackcat102/dynamic-routing-capsule-cifar

Also, check out this awesome page that aggregates the information about Capsule Networks.