We must stop crediting the wrong people for inventions made by others. Instead let's heed the recent call in the journal Nature: "Let 2020 be the year in which we value those who ensure that science is self-correcting." [SV20]

Like those who know me can testify, finding and citing original sources of scientific and technological innovations is important to me, whether they are mine or other people's [DL1] [DL2] [NASC1-9]. The present page is offered as a resource for members of the machine learning community who share this inclination. I am also inviting others to contribute additional relevant references. By grounding research in its true intellectual foundations, I do not mean to diminish important contributions made by others. My goal is to encourage the entire community to be more scholarly in its efforts and to recognize the foundational work that sometimes gets lost in the frenzy of modern AI and machine learning.

Here I will focus on six false and/or misleading attributions of credit to Dr. Hinton in the press release of the 2019 Honda Prize [HON]. For each claim there is a paragraph (I, II, III, IV, V, VI) labeled by "Honda," followed by a critical comment labeled "Critique." Reusing material and references from recent blog posts [MIR] [DEC], I'll point out that Hinton's most visible publications failed to mention essential relevant prior work - this may explain some of Honda's misattributions.

Executive Summary. Hinton has made significant contributions to artificial neural networks (NNs) and deep learning, but Honda credits him for fundamental inventions of others whom he did not cite. Science must not allow corporate PR to distort the academic record. Sec. I: Modern backpropagation was created by Linnainmaa (1970), not by Rumelhart & Hinton & Williams (1985). Ivakhnenko's deep feedforward nets (since 1965) learned internal representations long before Hinton's shallower ones (1980s). Sec. II: Hinton's unsupervised pre-training for deep NNs in the 2000s was conceptually a rehash of my unsupervised pre-training for deep NNs in 1991. And it was irrelevant for the deep learning revolution of the early 2010s which was mostly based on supervised learning - twice my lab spearheaded the shift from unsupervised pre-training to pure supervised learning (1991-95 and 2006-11). Sec. III: The first superior end-to-end neural speech recognition was based on two methods from my lab: LSTM (1990s-2005) and CTC (2006). Hinton et al. (2012) still used an old hybrid approach of the 1980s and 90s, and did not compare it to the revolutionary CTC-LSTM (which was soon on most smartphones). Sec. IV: Our group at IDSIA had superior award-winning computer vision through deep learning (2011) before Hinton's (2012). Sec. V: Hanson (1990) had a variant of "dropout" long before Hinton (2012). Sec. VI: In the 2010s, most major AI-based services across the world (speech recognition, language translation, etc.) on billions of devices were mostly based on our deep learning techniques, not on Hinton's. Repeatedly, Hinton omitted references to fundamental prior art (Sec. I & II & III & V) [DL1] [DL2] [DLC] [MIR] [R4-R8]. However, as Elvis Presley put it, "Truth is like the sun. You can shut it out for a time, but it ain't goin' away."

I. Honda: "Dr. Hinton has created a number of technologies that have enabled the broader application of AI, including the backpropagation algorithm that forms the basis of the deep learning approach to AI."

Critique: Hinton and his co-workers have made certain significant contributions to deep learning, e.g., [BM] [CDI] [RMSP] [TSNE] [CAPS]. However, the claim above is plain wrong. He was 2nd of 3 authors of an article on backpropagation [RUM] (1985) which failed to mention that 3 years earlier, Paul Werbos proposed to train neural networks (NNs) with this method (1982) [BP2]. And the article [RUM] even failed to mention Seppo Linnainmaa, the inventor of this famous algorithm for credit assignment in networks [BP1] (1970), also known as "reverse mode of automatic differentiation." (In 1960, Kelley already had a precursor thereof in the field of control theory [BPA]; compare [BPB] [BPC].) See also [R7].

By 1985, compute had become about 1,000 times cheaper than in 1970, and desktop computers had become accessible in some academic labs. Computational experiments then demonstrated that backpropagation can yield useful internal representations in hidden layers of NNs [RUM]. But this was essentially just an experimental analysis of a known method [BP1][BP2]. And the authors [RUM] did not cite the prior art [DLC]. (BTW, Honda [HON] claims over 60,000 academic references to [RUM] which seems exaggerated [R5].) More on the history of backpropagation can be found at Scholarpedia [DL2] and in my award-winning survey [DL1].

The first successful method for learning useful internal representations in hidden layers of deep nets was published two decades before [RUM]. In 1965, Ivakhnenko & Lapa had the first general, working learning algorithm for deep multilayer perceptrons with arbitrarily many layers (also with multiplicative gates which have become popular) [DEEP1-2] [DL1] [DL2]. Ivakhnenko's paper of 1971 [DEEP2] already described a deep learning feedforward net with 8 layers, much deeper than those of 1985 [RUM], trained by a highly cited method which was still popular in the new millennium [DL2], especially in Eastern Europe, where much of Machine Learning was born. (Ivakhnenko did not call it an NN, but that's what it was.) Hinton has never cited this, not even in his recent survey [DLC]. Compare [MIR] (Sec. 1) [R8].

Note that there is a misleading "history of deep learning" propagated by Hinton and co-authors, e.g., Sejnowski [S20]. It goes more or less like this: In 1958, there was "shallow learning" in NNs without hidden layers [R58]. In 1969, Minsky & Papert [M69] showed that such NNs are very limited "and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s" [S20]. However, "shallow learning" (through linear regression and the method of least squares) has actually existed since about 1800 (Gauss & Legendre [DL1] [DL2]). Ideas from the early 1960s on deeper adaptive NNs [R61] [R62] did not get very far, but by 1965, deep learning worked [DEEP1-2] [DL2] [R8]. So the 1969 book [M69] addressed a "problem" that had already been solved for 4 years. (Maybe Minsky really did not know; he should have known though.)

II. Honda: In 2002, he introduced a fast learning algorithm for restricted Boltzmann machines (RBM) that allowed them to learn a single layer of distributed representation without requiring any labeled data. These methods allowed deep learning to work better and they led to the current deep learning revolution.

Critique: No, Hinton's interesting unsupervised [CDI] pre-training for deep NNs (e.g., [UN4]) was irrelevant for the current deep learning revolution. In 2010, our team showed that deep feedforward NNs (FNNs) can be trained by plain backpropagation and do not at all require unsupervised pre-training for important applications [MLP1] - see Sec. 2 of [DEC]. This was achieved by greatly accelerating traditional FNNs on highly parallel graphics processing units called GPUs. Subsequently, in the early 2010s, this type of unsupervised pre-training was largely abandoned in commercial applications - see [MIR], Sec. 19.

Apart from this, Hinton's unsupervised pre-training for deep FNNs (2000s, e.g., [UN4]) was conceptually a rehash of my unsupervised pre-training for deep recurrent NNs (RNNs) (1991) [UN0-UN3] which he did not cite. Hinton's 2006 justification was essentially the one I used for my stack of RNNs called the neural history compressor [UN1-2]: each higher level in the NN hierarchy tries to reduce the description length (or negative log probability) of the data representation in the level below. (BTW, [UN1-2] also introduced the concept of "compressing" or "collapsing" or "distilling" one NN into another, another technique later reused by Hinton without citing it - see Sec. 2 of [MIR] and [R4].) By 1993, my method was able to solve previously unsolvable "Very Deep Learning" tasks of depth > 1000 [UN2] [DL1]. See [MIR], Sec. 1: First Very Deep NNs, Based on Unsupervised Pre-Training (1991). (See also our 1996 work on unsupervised neural probabilistic models of text [SNT] and on unsupervised pre-training of FNNs through adversarial NNs [PM2].) Then, however, we replaced the history compressor by the even better, purely supervised LSTM - see Sec. III. That is, twice my lab spearheaded a shift from unsupervised to supervised learning (which dominated the deep learning revolution of the early 2010s [DEC]). See [MIR], Sec. 19: From Unsupervised Pre-Training to Pure Supervised Learning (1991-95 & 2006-11).

III. Honda: "In 2009, Dr. Hinton and two of his students used multilayer neural nets to make a major breakthrough in speech recognition that led directly to greatly improved speech recognition."

Critique: This is very misleading. See Sec. 1 of [DEC]: The first superior end-to-end neural speech recogniser that outperformed the state of the art was based on two methods from my lab: (1) Long Short-Term Memory (LSTM, 1990s-2005) [LSTM0-6] (overcoming the famous vanishing gradient problem first analysed by my student Sepp Hochreiter in 1991 [VAN1]); (2) Connectionist Temporal Classification [CTC] (my student Alex Graves et al., 2006). Our team successfully applied CTC-trained LSTM to speech in 2007 [LSTM4] (also with hierarchical LSTM stacks [LSTM14]). This was very different from previous hybrid methods since the late 1980s which combined NNs and traditional approaches such as Hidden Markov Models (HMMs), e.g., [BW] [BRI] [BOU]. Hinton et al. (2009-2012) still used the old hybrid approach [HYB12]. They did not compare their hybrid to CTC-LSTM. Alex later reused our superior end-to-end neural approach [LSTM4] [LSTM14] as a postdoc in Hinton's lab [LSTM8]. By 2015, when compute had become cheap enough, CTC-LSTM dramatically improved Google's speech recognition [GSR] [GSR15] [DL4]. This was soon on almost every smartphone. Google's 2019 on-device speech recognition of 2019 (not any longer on the server) was still based on LSTM. See [MIR], Sec. 4.

IV. Honda: "In 2012, Dr. Hinton and two more students revolutionized computer vision by showing that deep learning worked far better than the existing state-of-the-art for recognizing objects in images."

Critique: See Sec. 2 of [DEC] (relevant parts repeated here for convenience): The basic ingredients of the computer vision revolution through convolutional NNs (CNNs) were developed by Fukushima (1979), Waibel (1987), LeCun (1989), Weng (1993) and others since the 1970s [CNN1-4]. A success of Hinton's team (ImageNet, Dec 2012) [GPUCNN4] was mostly due to GPUs used to speed up CNNs (they also used Malsburg's ReLUs [CMB] and a variant of Hanson's rule [Drop1] without citation; see Sec. V). However, the first superior award-winning GPU-based CNN was created earlier in 2011 by our team in Switzerland (my postdoc Dan Ciresan et al.) [GPUCNN1,3,5] [R6]. Our deep and fast CNN, sometimes called "DanNet," was a practical breakthrough. It was much deeper and faster than earlier GPU-accelerated CNNs [GPUCNN]. Already in 2011, it showed "that deep learning worked far better than the existing state-of-the-art for recognizing objects in images." In fact, it won 4 important computer vision competitions in a row between May 15, 2011, and September 10, 2012 [GPUCNN5], before the similar GPU-accelerated CNN of Hinton's student Krizhevsky won the ImageNet 2012 contest [GPUCNN4-5] [R6].

At IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition in an international contest (where a team of Hinton's frequent co-author LeCun took second place). Even the NY Times mentioned this. DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), a contest on object detection in large images (ICPR, 10 Sept 2012), at the same time a medical imaging contest on cancer detection. All before ImageNet 2012 [GPUCNN4-5] [R6]. Our CNN image scanners were 1000 times faster than previous methods [SCAN]. The tremendous importance for health care etc. is obvious. Today IBM, Siemens, Google and many startups are pursuing this approach. Much of modern computer vision is extending the work of 2011, e.g., [MIR], Sec. 19.

V. Honda: "To achieve their dramatic results, Dr. Hinton also invented a widely used new method called "dropout" which reduces overfitting in neural networks by preventing complex co-adaptations of feature detectors."

Critique: However, "dropout" is actually a variant of Hanson's much earlier stochastic delta rule (1990) [Drop1]. Hinton's 2012 paper [GPUCNN4] did not cite this.

Apart from this, already in 2011 we showed that dropout is not necessary to win computer vision competitions and achieve superhuman results - see Sec. IV above. Back then, the only really important task was to make CNNs deep and fast on GPUs [GPUCNN1,3,5] [R6]. (Today, dropout is rarely used for CNNs.)

VI. Honda: "Of the countless AI-based technological services across the world, it is no exaggeration to say that few would have been possible without the results Dr. Hinton created."

Critique: Name one that would NOT have been possible! Most famous AI applications are based on results created by others. Here a representative list of our contributions, taken from Sec. 1 and Sec. 2 of [DEC]:

1. Computer vision. See Sec. IV, V above, and Sec. 2 of [DEC].

2. Speech recognition. See Sec. III above, and Sec. 1 of [DEC].

3. Language processing. The first superior end-to-end neural machine translation was also based on our LSTM. In 1995, we already had excellent neural probabilistic models of text [SNT]. In 2001, we showed that our LSTM can learn languages unlearnable by traditional models such as HMMs [LSTM13]. That is, a neural "subsymbolic" model suddenly excelled at learning "symbolic" tasks. Compute still had to get 1000 times cheaper, but by 2016-17, both Google Translate [GT16] [WU] (which mentions LSTM over 50 times) and Facebook Translate [FB17] were based on two connected LSTMs [S2S], one for incoming texts, one for outgoing translations - much better than what existed before [DL4]. By 2017, Facebook's users made 30 billion LSTM-based translations per week [FB17] [DL4]. Compare: the most popular youtube video needed 2 years to achieve only 6 billion clicks.

4. Connected handwriting recognition. Already in 2009, through the efforts of Alex, CTC-LSTM [CTC] [LSTM1-6] became the first recurrent NN (RNN) to win international competitions, namely, three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic).

5. Robotics. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics, e.g., [LSTM-RL] [RPG]. In the 2010s, combinations of RL and LSTM have become standard. For example, in 2018, an RL LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher [OAI1] [OAI1a].

6. Video Games. In 2019, DeepMind famously beat a pro player in the game of Starcraft, which is harder than Chess or Go [DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by RL [DM3]. An RL LSTM (with 84% of the model's total parameter count) also was the core of the famous OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018) [OAI2] [OAI2a]. See [MIR], Sec. 4.

In the recent decade of deep learning, all of 2-6 above depended on our LSTM. See [MIR], Sec. 4. And there are innumerable additional LSTM applications ranging from healthcare & chemistry & molecule design to stock market prediction and self-driving cars [DEC]. By 2016, more than a quarter of the power of all those Tensor Processing Units in Google's datacenters was used for LSTM (only 5% for CNNs) [JOU17]. Apparently [LSTM1] has become the most cited AI and NN research paper of the 20th century [R5]. By 2019, it got more citations per year than any other computer science paper of the 20th century [DEC]. The current record holder of the 21st century [HW2][R5] is also related to LSTM, since ResNet [HW2] (Dec 2015) is a special case of our Highway Net (May 2015) [HW1], the feedforward net version of vanilla LSTM [LSTM2] and the first working, really deep feedforward NN with over 100 layers. (Admittedly, however, citations are a highly questionable measure of true impact [NAT1].)

7. Medical imaging etc. Some of the most important NN applications are in healthcare. In 2012, our Deep Learner was the first to win a medical imaging contest (on cancer detection), before ImageNet 2012 [GPUCNN5] [R6]. Similar for materials science and quality control: Already in 2010, we introduced our deep and fast GPU-based NNs to Arcelor Mittal, the world's largest steel maker, and were able to greatly improve steel defect detection [ST]. This may have been the first deep learning breakthrough in heavy industry. There are many other early applications of our deep learning methods which were frequently used by Hinton.

Our additional priority disputes with Hinton included: compressing / distilling one NN into another [MIR] (Sec. 2), learning sequential attention with NNs [MIR] (Sec. 9), fast weights through outer products [MIR] (Sec. 8), unsupervised pre-training for deep NNs [MIR] (Sec. 1), and other topics. Compare [R4].

Concluding Remarks

Dr. Hinton and co-workers have made certain significant contributions to NNs and deep learning, e.g., [BM] [CDI] [RMSP] [TSNE] [CAPS]. But his most visible work (lauded by Honda) popularized methods created by other researchers whom he did not cite. As emphasized earlier [DLC]: "The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it)."

It is a sign of our field's immaturity that popularizers are sometimes still credited for inventions of others. Honda should correct this. Else others will. Science must not allow corporate PR to distort the academic record. Similar for certain scientific journals, which "need to make clearer and firmer commitments to self-correction" [SV20].

Unfortunately, Hinton's frequent failures to credit essential prior work by others cannot serve as a role model for PhD students who are told by their advisors to perform meticulous research on prior art, and to avoid at all costs the slightest hint of plagiarism.

Yes, this critique is also an implicit critique of certain other awards to Dr. Hinton. It is also related to some of the most popular posts and comments of 2019 at reddit/ml, the largest machine learning forum with over 800k subscribers. See, e.g., posts [R4-R8] influenced by [MIR] (although my name is frequently misspelled).

Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas, as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation [NASC1-2], the telephone [NASC3], the computer [NASC4-7], resilient robots [NASC8], and scientists of the 19th century [NASC9].

At least in science, by definition, the facts will always win in the end. As long as the facts have not yet won it's not yet the end. (No fancy award can ever change that.)

As Elvis Presley put it, "Truth is like the sun. You can shut it out for a time, but it ain't goin' away."