MIT Technology Review recently ran an article titled “Why even a moth’s brain is smarter than an AI”. Before I go into details about the “moth brain”, let me just start with an unsurprising spoiler: the above article is as wrong as its title is obnoxiously provocative. Even if the moth brain — or some mathematical model associated with it — is “smarter” (i.e. more sample efficient) than current deep learning approaches, there is no clear evidence for that presented anywhere. Neither in the cited paper, nor in the follow up paper “A moth brain learns to read MNIST”. (If you only care about the analysis of their MNIST results, then jump to the last few paragraphs of this article).

Besides the sensationalistic title of the MIT Technology Review article, the seemingly interesting fact about the research reported there is that Charles Delahunt et al. claim that they managed to create a mathematical model of some tiny section of moth’s brain that can learn from significantly less training instances than current machine learning approaches. This is a big claim. Also this is a topic that a lot of researchers (including myself) are naturally interested in: In fact, it is folklore in some circles that deep learning is notoriously inefficient when it comes to learning from limited amount of training data. So, when this proverbial fact arises, researchers nod wisely, pretend it was a big problem and go back to their huge data sets happily. It is like the weather: everybody complains about it, but nobody does anything about it. Or rather: a lot of people claim to work on it, still it never seems to improve. It is a deja vu feeling for me, since until 2014, everybody was complaining about vanishing gradients until they suddenly did not. It turned out, one just needed to try training extremely deep neural networks, but instead of actually doing it, most people just kept on complaining about vanishing and exploding gradients. With ReLU activation and proper initialization, the problem proved to be much tamer than perceived before.

I am deeply suspicious of folklore knowledge, even more suspicious than of people coming from biology claiming that they suddenly outperform state of the art machine learning approaches by modelling some biological systems. My suspicion about the claimed poor sample efficiency of deep learning stems from personal experience. Even in the early days of deep learning based computer vision, in 2013, we successfully trained object detection systems for the VOC benchmark utilizing only just a couple of thousand training images, without pretraining on ImageNet, while matching or exceeding the then best (non-deep learning) techniques. Of course, training from so little data needed a lot of augmentation and regularization, still it was clearly doable with enough patience and skill. I would rather say: deep learning just scales better with the size of the data-set than almost all known competing approaches. In fact, I think understanding the limits of deep learning for large data is more fruitful than complaining about its poor generalization from too little.

But let us get back to the paper with the striking title “Biological Mechanisms for Learning: A Computational Model of Olfactory Learning in the Manduca sexta Moth, with Applications to Neural Nets” by Charles Delahunt et al. First, it is my personal opinion that this line of research is exciting. It is both extremely very relevant (for AI as well) and has great potential. I really like several aspects of the paper and I think if I would review it for some conference or journal, I would like it to appear, but with more careful claims. My criticism is expressed here only because I think the work is a great step towards much more research in this domain and it is worth a close look. On the other hands, the paper seems to have a few flaws, but on the bright side, it is written very well, with a lot of care and thorough reporting, containing valuable experimental results. Also, the authors plan to publish their computational model soon, which will foster future experiments in this domain.

Here, I just want to point out their computational model is verified by analyzing statistical features of their model and by observing that certain high level neural firing statistics are in agreement with that observed in moths. Also, they run experiments to check that the mathematical model is capable of some learning behaviour. This is clearly useful as a general high level test, but still a relatively weak evidence for the claim that the model faithfully emulates all important characteristics of that part of the moth brain. For example, given that their model is a slight modification of the standard spiking neural networks abstraction (with some simple connection and firing statistics modelled after the neural connections found in the moth brain), it really boils down to the detailed quality of the model: the consistent reaction to the same stimuli and the quality of the learning performance: speed, sample efficiency and retaining of old knowledge. Unless these properties are not measured in a comparable manner and verified in more detail, it is very hard to argue for the faithfulness of the model. Still, as mentioned above, the paper presents a well founded initial working hypothesis that hopefully triggers more new research in large scale mathematical modelling of biological neural systems.

Given that I am not an expert in biology I don’t want to go in details about the above paper, but I wanted to express some skepticism on a the aptly named follow up extended abstract “A moth brain learns to read MNIST”. This is more in line with my background and I can ask more well founded questions or suggest constructive criticism. Again, I am thankful that the authors chose to publish this work early on, since it helps to evaluate their approach much more substantially, not just based on some observations on a domain that has not been studied by AI researchers (smells perceived by moths), but on the most extensively studied machine learning data-set in existence: MNIST.

Delahunt et al claims that they outperform most baselines on MNIST. For example they claim that they reach around 75% accuracy while utilizing ten examples per class (that is for a total of 100 supervised samples).

There are quite a few bad things about MNIST —for example that experimental conclusions on MNIST rarely transfer to other data sets — the good thing about it is that there are a lot of baseline results for it. It is known that ladder networks by Harri Valpola do a great job at semi-supervised learning on MNIST. Their paper reports about 99% accuracy by utilizing only 100 labelled training samples. This is vastly better than the reported ~75% for the moth brain inspired models. Not to mention their best “baseline” result of 60% with SVM and even worse results with convolutional networks. The ladder network accuracy comes with the caveat that it utilizes the rest of the MNIST (50000 letters) in unlabelled form, while MothNet did not. This is not a deal breaker in real life applications, since unlabelled training data is very cheap to come by. Also arguably, humans and animals learn from a huge amount of unlabelled data as well, so it seems to be a good idea to leverage it.

I got curious: If I don’t use any unlabelled data, what baseline can I get with so few training instances on MNIST with minimum effort? For lack of time, I constrained myself on training only a single linear layer of SoftMax classifier, straight over the image pixels. I have tuned only three basic hyper-parameters. Knowing that increasing the batch size leads to worse generalization, I have resorted to training with batch-size 1. When using 90% dropout on the input pixels, weight decay 0.00001 and Adam optimizer with learning rate 0.1, I could easily get to an accuracy of 78%. I don’t claim that this experiment is scientific or the best possible result, but it was my third try. This is already marginally better than the claimed result for MothNet and significantly better than their baseline. It took about 10 minutes of experimentation. With similar settings, 60% accuracy could be reached for 3 samples per class and 45% accuracy for one sample per class. Now these numbers are worse than what they claim for “Moth Fast” (still better then their “Moth Natural” numbers and much better then their reported baseline). Admittedly, I could not beat the 70% accuracy for the 1 sample/class case easily, but I have not even started to try any deep network at all.

Given the fact that their results (and especially the baselines) are so easy to beat by a large margin, even with a single linear layer — although they have reported SVM results that should be quite comparable to my experiments —it made me weary about their deeper (CNN) baselines as well. I guess they have not tried dropout at all, which is a bit unfair since dropout is a well established technique and MothNet uses noise for training, too.

However, the higher level moral of the story: if the machine learning community would like to get more serious about sample efficiency (which it claims it wants), we should stop complaining about the weather, but establish some standard baseline results for training with few samples that can be tracked and referred to. Otherwise we might just walk around with umbrellas while the weather is sunny and no clouds are in sight.