Interview by MIT Assistant Professor Song Han at NeurIPS 2019









Also available on "Leader in AI" podcast: Apple Podcast | Spotify | Google Play Music





Prof. Yoshua Bengio is one of the founding fathers of Deep Learning and winner of the 2018 Turing Award jointly with Geoffrey Hinton and Yann LeCun. He was interviewed by Song Han , MIT assistant professor and Robin.ly Fellow Member, at NeurIPS 2019 to share in-depth insights on deep learning research, specifically the trend from unconscious to conscious deep learning. He also discussed how to mitigate the carbon footprints from AI research through higher computing efficiency and urged young researchers to focus on ambitious, long-term goals and keep an open engagement with their peers.

Subscribe to Robin.ly newsletter to stay updated for more inspiring AI leader talks:

Full Interview Transcripts

Yoshua Bengio @ NeurIPS 2019: From Deep Learning to Consciousness





Journey in Deep Learning





Song Han:

You’ve been working at the deep learning area for many decades, could you share with us about your journey, your mission, how it evolved?





Yoshua Bengio:

Right, so, you know, the relationship between scientists, researchers, and ideas can be one that is very emotional. So I've always been very passionate about the research I do. But really, I fell in love with this: what I call “the amazing hypothesis”, that there might be a few simple principles that can explain our intelligence. And it's when I started reading neural network papers around 1985. It was a long time ago.





Maybe the papers which impressed me the most were coming out of Geoffrey Hinton's group. And I kind of knew this was what I wanted to do, and, and that has continued since then. Initially, when I started in the late 80s, very few people were doing it, but it was something hot and you know, many people were entering the field. And then I finished my Ph.D. in 91. But in the 90s, the interest in neural networks gradually decreased as other machine learning approaches took over. So there's been a long period where what really kept me working on this was precisely this emotional aspect that you know, I really felt strongly about these ideas.





I also looked around and tried to understand some of the limitations of neural networks and other methods like Kernel methods that helped me understand more mathematically why my intuitions were right. And then of course, in the last decade, things have really exploded in successful applications and benchmarks, and the whole field of machine learning, thanks to deep learning, as it has become something that's not just done in universities, but has become a social thing where it's a huge business, and it's changing society, potentially, sometimes in bad ways. And so we also have a responsibility for it.









From Attention to Consciousness





Song Han:

Yesterday, you gave a great presentation about “from system one (deep learning) to system two (deep learning)”, and I think, consciousness/attention model is the core part of it. Could you share more about your thoughts and your findings?





Yoshua Bengio:

Yes, so it's interesting. The C-word, consciousness, has been a bit of a taboo in many scientific communities. But in the last couple of decades, the neuroscientists, and cognitive scientists have made quite a bit of progress in starting to pin down what consciousness is about. And of course, there are different aspects to it. There are several interesting theories like the global workspace theory. And now I think we are at a stage where machine learning, especially deep learning, can start looking into neural net architectures and objective functions and frameworks that can achieve some of these functionalities. And what's most exciting for me is that these functionalities may provide evolutionary advantages to humans and thus if we understand those functionalities they would also be helpful for AI.





Song Han:

The relationship between consciousness and attention: is it fair to say that attention is finding the mapping from the large dimension of the unconscious set to the low dimensional conscious set to help with generalization?





Yoshua Bengio:

Exactly. And the thing that is interesting is that this mechanism of selecting just a few variables at a time corresponds according to my theories to a, you can think of it as like a regularizer. And a priori, an assumption about the world, which humans use in order to form the kind of high-level concepts that we manipulate with language. So, if I say “if I dropped the ball, it is going to fall on the ground” this sentence involves very few concepts at a time. Attention is selected, just the right words, a few concepts; and together, they have a strong dependency. So you know, I can predict the effect of some action, for example, and that's what the sentence claims. And it gives a very high probability to that event. And in a way, it's kind of outstanding.





It's kind of extraordinary that we are able to make such predictions about the future, using so few pieces of information, so few variables. And this attention mechanism can thus correspond to an assumption about how we organize our knowledge about the world, so it's about knowledge representation and it's about language. So the kinds of concepts that we manipulate with language would correspond with the kind of concepts we have at the highest level of representation in our mind.





Song Han:

So not only language but also reinforcement learning, like you're showing in the RIM (Recurrent Independent Mechanisms) paper recently published, the Atari game is showing a great generalization ability compared with conventional RNN.





Yoshua Bengio:

Yes, so this notion of consciousness, I think, for learning, machines is particularly important for learning agents. So an agent is an entity that acts in some environments like us and animals and, the machines we might build and robots. And the agents have this problem that the world is changing around them. They don't see always the same distribution. And so they need to be able to adapt to those changes to understand those changes very quickly. And what I'm proposing is that the mechanism of consciousness, help them do that by organizing them on knowledge into smaller pieces that can be recombined dynamically like in the RIM paper. We can be more robust to those changes in the world. And we found in experiments, indeed, that these kinds of architectures allow to generalize better to, for example, longer sequences than what has been seen during training.





Song Han:

Therefore, we no longer need to shuffle the data, but make it generalize by attending to only the data it should do...





Yoshua Bengio:

Yeah, we don't want to shuffle the data back. So when we shuffle the data, we are destroying information, right? There was a structure and after we shuffled, we lost that structure. That structure may have, you know, comes from the time at which things were collected. Maybe there, you know, initially, we were in some regime of the data and then something changed and the data is a bit different. When we shuffle, we lose that information. And of course, it makes it easier to generalize but it's cheating because in the real world the data is not shuffled. What's going to happen tomorrow is not going to be quite the same as what happened yesterday. And so instead of shuffling your data, what we have to do is to build systems that are going to be robust to those changes. And that's where also meta learning becomes important.

Learning to learn





Song Han:

Yeah, talking about meta learning, you had a paper back in the 1990s about meta learning and learning to learn, and recently is getting very hot again for neural architecture search. Could you share some of your thoughts and advancement in “learning to learn”?





Yoshua Bengio:

Yeah. So when I started thinking about “learning to learn”, in those days we didn't call it meta learning. It's just learning to learn. I was inspired by the relationship between learning in the individual person or animal and evolution. So you can think of evolution as somewhat like an optimization, not exactly, in the sense that different species get better and better at what they're doing through evolution. And then that our outer loop is like, you know, a slow timescale, there's this process that evolves better and better solutions. But then within the lifetime of an individual, there are also improvements due to learning. So it's like learning inside learning.





And what we did in this paper as we showed, you can use the same tools that we had that we're just fresh back-propagation to optimize the two things together. And more recently, what has been done in the last few years, applying these ideas to optimize how the learner is going to not just do better at the task, but generalize. So, learn how to better generalize, and in fact, you can learn how to better generalize even if the world changes. So you can, you can learn to be robust to changes and distribution, which is not possible if you have like the normal static framework where you assume one distribution and you do normal training, but meta learning, in theory at least, allows to do end-to-end learning of how to generalize to changes and distribution and be robust to changes and distribution. And that's kind of significant conceptually.





Song Han:

Totally agree. And also, since we are adding a “for loop” outside the learning “for loop”, the computation complexity is pretty heavy recently.





Yoshua Bengio:

That's why for many years this area was not very popular. But now we have a lot more compute power than in the early 90s. And we are starting to see how things like learning from few examples can be done with meta learning. Thanks to the extra computing capabilities of GPUs and TPUs.









Carbon Footprint & Computing Efficiency





Song Han:

And it is also noticed that the carbon footprint caused by such training is very huge. You have a website for that, calculating the CO2 emission, CO2 cost. What is your thought about it? Thinking environmentally?





Yoshua Bengio:

Right. So nothing is simple in life and there are lots of important subtleties. So, machine learning can be used to tackle climate change. We wrote a very long paper explaining many applications in climate science and designing better materials, being more efficient in the use of electricity or being able to take advantage of renewable energy better. So, machine learning can be used to help us in this big challenge for humanity, which is climate change.





But at the same time, all this computing power is potentially drawing more electricity that comes from non-renewables and creates a large carbon footprint. So it depends where you're running your experiments. If your GPU is drawing electricity in Quebec, which is where I live, it's a hundred percent renewable, hydroelectricity, and so there's no carbon footprint. But if you're doing it in the US, it depends where, or in China for example, where there's a lot of coal, then it's a different story and your experiments, if there are big experiments, can really draw a lot of power. And what's maybe more concerning is that researchers in the industry especially are building gradually bigger and bigger models. And it's growing very fast, like a doubling period every three months.





Song Han:

Faster than Moore's Law.





Yoshua Bengio:

Faster than Moore's law. Exactly. So, you know, we can't sustain that expansion, eventually it’s going to take all the electricity to run these AI systems. That's not good. So we need people like you, to help us design systems that are going to be more efficient in terms of energy. So tell me, you know, how you think we should address this?





Song Han:

Yeah, thanks for the question. I think we need (algorithm and hardware) co-design for such challenging tasks. Conventionally, we rely on Moore's law to give the free lunch of performance improvements, expecting the computer to be faster every year. As Moore's Law slowing down, we want to both look at the algorithm (and the hardware), how shall we reduce the memory footprint, and I think it is the memory footprint that causes the energy. Computation is cheap, Memory is expensive. We had several successes like Deep Compression where we can reduce the model size by an order of magnitude to reduce memory. The Efficient Inference Engine saves the computation by skipping those zeros (zero multiplied with anything is zero). Recently we've been working on reducing the cost of neural architect research for transformers, previously taking the carbon (footprint) of five cars’ lifetime.





Yoshua Bengio:

That's another subtle thing, which is that those numbers that have been reported in the press that create huge footprints are mostly due to the hyperparameter optimization searching in the space of architectures and hyperparameters. And that is, like 1000 times more expensive than training a single network. So, if you're in academia, like me, and you don't have access to large computing power, and you rely on human brains to do the search and it is much more efficient. So you don't have as much computing power but the students who are doing the experiments, they've done many experiments in the past and they know how to explore and so they find good solutions. Whereas the methods we currently use for exploring the space of architectures are more like brute force. And so that's super expensive.





Song Han:

Yeah, totally agree. When I just joined MIT last year, we have only eight GPU cards, by no way my students can do neural architecture search. So he has to combine human intelligence with machine intelligence in a combined manner (to prune the space) for architecture search. And as a result, we can do the search in a more cost-efficient way.





Yoshua Bengio:

That's great.





Advice for Young Researchers





Song Han:

Thank you. Alright, so lastly, as you are an AI pioneer for many decades, what is your advice for the younger generation in future directions?





Yoshua Bengio:

So, one thing I find sad in the current culture of students and researchers in machine learning and AI, is they're very stressed, very anxious. There's a lot of competition. But the best science is not done in those conditions. The best science is done when you're thinking long term when you have time to really ponder and brainstorm and try things and let the ideas evolve. Instead, we are currently in a sort of rush of preparing something for the next deadline and the next deadline, every two-three months, we have another deadline. I don't think that's good for the field. And it's not even good psychologically, because you're always stressed.





So, my suggestion is to step back, to think about more ambitious goals, hard questions, rather than what can I do in the next few weeks or the next for the next deadline? And listen to your intuitions. And also, be open with your ideas, like, share your ideas, talk about them. Even if it's not published yet, don't be afraid of having other people steal your idea. It's much more profitable to engage positively with others for psychological reasons, but also in terms of scientific productivity than to try to keep our things and be secretive. It just doesn't work.





Song Han:

Totally agree. All right, thanks so much for sharing (both) the technical side and also the advice to the younger generation of people. I really appreciate your thoughts.





Yoshua Bengio:

Thank you for the questions.