GH: The were always a bunch of people who kept believing in it, particularly in psychology. But among computer scientists, I guess in the ’90s, what happened was data sets were quite small and computers weren't that fast. And on small data sets, other methods, like things called support vector machines worked a little bit better. They didn't get confused by noise so much. So that was very depressing, because in the ’80s we developed back propagation. We thought it would solve everything. And we were a bit puzzled about why it didn't solve everything. And it was just a question of scale, but we didn't really know that then.

NT: And so why did you think it was not working?

GH: We thought it was not working because we didn't have quite the right algorithms, we didn’t have quite the right objective functions. I thought for a long time it was because we were trying to do supervised learning, where you have to label data, and we should have been doing unsupervised learning, where you just learned from the data with no labels. It turned out it was mainly a question of scale.

NT: That's interesting. So the problem was, you didn't have enough data. You thought you had the right amount of data, but you hadn't labeled it correctly. So you just misidentified the problem?

GH: I thought just using labels at all was a mistake. You do most of your learning without making any use of labels, just by trying to model the structure in the data. I actually still believe that. I think as computers get faster, for any given size data set, if you make computers fast enough, you're better off doing unsupervised learning. And once you've done the unsupervised learning, you'll be able to learn from fewer labels.

NT: So in the 1990s, you're continuing with your research, you’re in academia, you are still publishing, but you aren't solving big problems. Was there ever a moment where you said, you know what, enough of this. I'm going to go try something else? Or did you just say, we're going to keep doing deep learning?

GH: Yes. Something like this has to work. I mean, the connections in the brain are learning somehow, and we just have to figure it out. And probably there's a bunch of different ways of learning connection strengths; the brain’s using one of them. There may be other ways of doing it. But certainly you have to have something that can learn these connection strengths. I never doubted that.

NT: So you never doubt it. When does it first start to seem like it's working?

"I'm not trying to make a model of how the brain works. I'm looking at the brain and saying, 'This thing works, and if we want to make something else that works, we should sort of look to it for inspiration.'" Geoffrey Hinton

GH: One of the big disappointments in the ’80’s was, if you made networks with lots of hidden layers, you couldn't train them. That's not quite true, because you could train for fairly simple tasks like recognizing handwriting. But most of the deep neural nets, we didn't know how to train them. And in about 2005, I came up with a way of doing unsupervised training of deep nets. So you take your input, say your pixels, and you'd learn a bunch of feature detectors that were just good at explaining why the pixels were even like that. And then you treat those feature detectors as the data, and you learn another bunch of feature detectors, so we could explain why those feature detectors have those correlations. And you keep learning layers and layers. But what was interesting was, you could do some math and prove that each time you learned another layer, you didn't necessarily have a better model of the data, but you had a band on how good your model was. And you could get a better band each time you added another layer.

NT: What do you mean, you had a band on how good your model was?

GH: Once you've got a model, you can say, “How surprising does a model find this data?” You show it some data and you say, “Is that the kind of thing you believe in, or is that surprising?” And you can sort of measure something that says that. And what you'd like to do is have a model, a good model is one that looks at the data and says, “Yeah, yeah, I knew that. It's unsurprising.” It's often very hard to compute exactly how surprising this model finds the data. But you can compute a band on that. You can say that this model finds the data less surprising than that one. And you could show that as you add extra layers of feature detectors, you get a model, and each time you add a layer, the band on how surprising it finds the data gets better.