Dr. Jürgen Schmidhuber is Director of the Swiss Artificial Intelligence Lab, IDSIA. His research team’s artificial neural networks (NNs) have won many international awards, and recently were the first to achieve human-competitive performance on various benchmark data sets. I asked him about their secrets of success.

AA: In several contests and machine-learning benchmarks, your team’s NNs are now outperforming all other known methods. As The New York Times noted Friday, last year, a program your team created won a pattern recognition contest by outperforming both competing software systems and a human expert in identifying images in a database of German traffic signs.

“The winning program accurately identified 99.46 percent of the images in a set of 50,000; the top score in a group of 32 human participants was 99.22 percent, and the average for the humans was 98.84 percent,” the Times pointed out. Impressive. What is the importance of traffic sign recognition in this field?

JS: That was from the IJCNN 2011 Traffic Sign Recognition Competition. This is highly relevant for self-driving cars as well as modern systems for driver’s assistance.

BTW, if you don’t obey a traffic sign in Switzerland, you go to jail. However, across the border there is Italy. There you’ll also find traffic signs in the street, but only for decoration. :)

AA: What’s your team’s secret?

JS: Remarkably, we do not need the traditional sophisticated computer vision techniques developed over the past six decades or so. Instead, our deep, biologically rather plausible artificial neural networks (NNs) are inspired by human brains, and they learn to recognize objects from numerous training examples.

I discuss this in detail in a talk at AGI-2011, “Fast Deep/Recurrent Nets for AGI Vision” (only voice and slides though):

We often use supervised, artificial, feedforward, or recurrent (deep by nature) NNs with many nonlinear processing stages. When we started this type of research over 20 years ago, it quickly became clear that such deep NNs are hard to train. This is due to the so-called “vanishing gradient problem” identified in the 1991 thesis of my former student Sepp Hochreiter, who is now a professor in Linz. But over time we found several ways around this problem. Committees of NNs improve the results even further.

In addition, we use GPUs (graphics cards), which are essential to accelerate learning by a factor of 50. This is sufficient to clearly outperform numerous previous more complex machine learning methods.

One of the reviewers called this a “wake-up call to the machine learning community.”

For sequential data, such as videos or connected handwriting, feedforward NNs do not suffice. Here, we use our bidirectional or multi-dimensional Long Sort-Term Memory recurrent NNs, which learn to maximize the probabilities of label sequences, given raw training sequences.

AA: You said that the field is currently experiencing a Neural Network “ReNNaissance.” What are the key awards your team has won?

JS: In the past three years, in addition to the IJCNN 2011 Traffic Sign Recognition Competition mentioned above, they won seven other highly competitive international visual pattern recognition contests:

ICPR 2012 Contest on “Mitosis Detection in Breast Cancer Histological Images.” This is important for breast cancer prognosis. Humans tend to find it very difficult to distinguish mitosis from other tissue. 129 companies, research institutes, and universities in 40 countries registered; 14 sent their results. Our NN won by a comfortable margin.

ISBI 2012 challenge on segmentation of neuronal structures. Given electron microscopy images of stacks of thin slices of animal brains, the goal is to build a detailed 3D model of the brain’s neurons and dendrites. But human experts need many hours to annotate the images: Which parts depict neuronal membranes? Which parts are irrelevant background? Our NNs learn to solve this task through experience with millions of training images. In March 2012, they won the contest on all three evaluation metrics by a large margin, with superhuman performance in terms of pixel error. (Ranks 2–6: for researchers at ETHZ, MIT, CMU, Harvard.) A NIPS 2012 paper on this is coming up.

ICDAR 2011Offline Chinese Handwriting Competition. Our team won the competition although none of its members speaks a word of Chinese. In the not-so-distant future you should be able to point your cell phone camera to text in a foreign language, and get a translation. That’s why we also developed low-power implementations of our NNs for cell phone chips.

Online German Traffic Sign Recognition Contest (2011, first and second rank). Until the last day of the competition, we thought we had a comfortable lead, but then our toughest competitor from NYU surged ahead, and our team (with Dan Ciresan, Ueli Meier, Jonathan Masci) had to work late-night to re-establish the correct order. :)

ICDAR 2009 Arabic Connected Handwriting Competition (although none of us speaks a word of Arabic).

ICDAR 2009 Handwritten Farsi/Arabic Character Recognition Competition (idem).

ICDAR 2009 French Connected Handwriting Competition. Our French also isn’t that good. :)

AA: Is that a record for international contests?

JS: Yes. Our NNs also set records in important machine-learning benchmarks:

The NORB Object Recognition Benchmark.

The CIFAR Image Classification Benchmark.

The MNIST Handwritten Digits Benchmark, perhaps the most famous benchmark. Our team achieved the first human-competitive result.

AA: Were the algorithms of your group the first deep learning methods to win such international contests?

JS: I think so.

AA: Did any other group win that many?

JS: No. All of this would not have been possible without the hard work of team members including Alex Graves, Dan Ciresan, Ueli Meier, Jonathan Masci, Alessandro Giusti, and others. And of course, our work builds on earlier work by great pioneers including Fukushima, Amari, Werbos, von der Malsburg, LeCun, Poggio, Hinton, Williams, Rumelhart, and many others.

AA: What are some of the practical applications of these techniques?

JS: These NNs are of great practical relevance, because computer vision and pattern recognition are becoming essential for thousands of commercial applications. For example, the future of search engines lies in image and video recognition, as opposed to traditional text search. The most important applications may be in medical imaging, e.g., for automated melanoma detection, cancer prognosis, plaque detection in CT heart scans (to prevent strokes), and hundreds of other health-related areas.

Autonomous robots depend on vision, too — see the AGI 2011 keynote (at Google HQ) by the robot car pioneer Ernst Dickmanns:

Our successes have also attracted the interest of major industrial companies. So we started several industry collaborations. Among other things, we developed:

State-of-the-art handwriting recognition for a software company.

State-of-the-art steel defect detection for the world’s largest steel maker.

State-of-the-art low-power, low-cost pattern recognition for a leading automotive supplier.

Efficient variants of our neural net pattern recognizers for apps running on cell phone chips.

More information on this can be found here and here.

Generally speaking, there is no end in sight for applications of these new-millennium neural networks.

AA: How do your team’s techniques differ from Google, Microsoft, and Nuance approaches to automated speech and image recognition?

JS: Of course I cannot officially speak for any particular firm. Let me just say that in recent years leading IT companies (whose names are known by everybody) have shown a lot of interest in our work. Many companies (and academic researchers) are now using the deep learning neural networks we developed and published over the years.

AA: What are your team’s future research plans?

JS: While the methods above work fine in many applications, they are passive learners — they do not learn to actively search for the most informative image parts. Humans, in contrast, use sequential gaze shifts for pattern recognition. This can be much more efficient than the fully parallel one-shot approach.

That’s why we want to combine the algorithms above with variants of our old method of the 1990. Back then, we built what to our knowledge was the first artificial fovea sequentially steered by a learning neural controller.

We also intend to combine this with our Formal Theory of Fun (FTF) and Curiosity & Creativity (see here and here)

As an artificial explorer driven by the FTF interactions with its environment, it is rewarded not only for solving external, user-defined problems, but also for inventing its own novel problems (e.g., better prediction of aspects of the environment, speeding up or simplifying previous solutions), thus becoming a more and more general problem solver over time.

Here some of our video lectures on artificial creativity: