What if every time you learned something new, you forgot a little of what you knew before? That sort of overwriting doesn’t happen in the human brain, but it does in artificial neural networks. It’s appropriately called catastrophic forgetting. So why are neural networks so successful despite this? How does this affect the future of things like self-driving cars? Just what limit does this put on what neural networks will be able to do, and what’s being done about it?

Numerical weights in an artificial neural network Neurons in the brain

The way a neural network stores knowledge is by setting the values of weights (the lines in between the neurons in the diagram). That’s what those lines literally are, just numbers assigned to pairs of neurons. They’re analogous to the axons in our brain, the long tendrils that reach out from one neuron to the dendrites of another neuron, where they meet at microscopic gaps called synapses. The value of the weight between two artificial neurons is roughly like the number of axons between biological neurons in the brain.

To understand the problem, and the solutions below, you need to know a little more detail.

To train a neural network to recognize objects in images, for example, you find a dataset containing thousands of images. One-by-one you show each image to the input neurons at one end of the network, and make small adjustments to all the weights such that an output neuron begins to represent an object in the image.

That’s then repeated for all of the thousands of images in the dataset. And then the whole dataset is run through, again, and again, thousands of times until individual outputs strongly represent specific objects in the images, i.e. the network has learned to recognize the particular objects in those images. All of that can take hours or weeks to do, depending on the speed of the hardware and the size of the network.

But what happens if you want to train it on a new set of images? The instant you start going through that process with new images, you start overwriting those weights with new values that no longer represent the values you had for the previous dataset of images. The network starts forgetting.

This doesn’t happen in the brain and no one’s certain why not.

Minimizing The Problem

Some networks minimize this problem. The diagram shows a simplified version of Google’s Inception neural network, for example. This neural network is trained for recognizing objects in images. In the diagram, all of the layers except for the final one, the one on the right, have been trained to understand features that make up images. Layers more to the left, nearer the input, have learned about simple features such as lines and curves. Layers deeper in have built on that to learn shapes made up of those lines and curves. Layers still deeper have learned about eyes, wheels and animal legs. It’s only the final layer that builds on that to learn about specific objects.

And so when retraining with new images and new objects, only the final layer needs to be retrained. It’ll still forget the objects it knew before, but at least we don’t have to retrain the entire network. Google actually lets you do this with their Inception neural network using a tutorial on their TensorFlow website.

Unfortunately, for most neural networks, you do have to retrain the entire network.

Does It Matter?

If networks forget so easily, why hasn’t this been a problem? There are a few reasons.

Take self-driving cars, for example. Neural networks in self-driving cars can recognize traffic signs. But what if a new type of traffic sign is introduced? Well, the training of these networks isn’t done in the car. Instead the training is done at some facility with fast computers with multiple GPUs. (We talked about GPUs for neural networks in this article.)

Since such fast hardware is available, the new traffic sign can be added to the complete dataset and the network can be retrained from scratch. The network is then transmitted to the cars over the internet as an update. Making use of a trained network requires nowhere near the computational speed of training a network. To recognize an object involves just a single pass through the network. Compare that to the training we described above with the thousands of iterations through a dataset.

What about a more immediate problem, such as a new type of vehicle on the road? In that case, the car already has sensors for detecting objects and avoiding them. It either doesn’t need to recognize the new type of vehicle or can wait for an update.

A lot of neural networks are not even located at the place where their knowledge is used. We’re talking about appliances like Alexa. When you ask it a question, the audio for that question can be transmitted to a location where a neural network does the speech recognition. If retraining is needed, it can be done without the consumer’s device being involved at all.

And many neural networks simply never need to be retrained. Like most tools or appliances, once built, they simply continue performing their function.

What Has Been Done To Eliminate Forgetting?

Luckily, most companies are in business to make a buck in the short to medium term. That usually means neural networks with narrow purposes. Where it causes problems is when a neural network needs to constantly be learning to solve novel problems. That’s the case with Artificial General Intelligence (AGI).

Very few companies are tackling AGI. Back in February of 2016, researchers at Facebook AI Research released a paper wherein they gave a Roadmap towards Machine Intelligence, but it detailed only an environment for training an AGI, not how the AGI would be implemented.

Google’s DeepMind has repeatedly stated that their goal is to produce an AGI. In December 2016 they uploaded a paper called “Overcoming catastrophic forgetting in neural networks”. After citing previous research, they then cite research in mouse brains that shows that when learning a new skill, the volume of dendritic spines increases. Basically that means the old skills may be protected by the synapses becoming less plastic, less changeable.

They then go on to detail their analogous approach to this synaptic activity which they call Elastic Weight Consolidation (EWC). In a nutshell, they slow down the modification of weights that are important to already learned things. That way, weights that aren’t as important to anything that’s already been learned are favored for new things.

They test their algorithm on handwriting recognition and more interestingly, on a neural network you may have heard of. It was the network that was in the news back in 2015 that learned how to play different Atari games, some at a superhuman level. A neural network that can skillfully play Breakout, Pong, Space Invaders and others sounds like a general purpose AI already. However, what was missing from the news was that it could be trained to play only one a time. If it was trained to play Breakout, to then play Pong it had to be retrained, forgetting how to play Breakout in the meantime.

But with the new EWC algorithm, it was simultaneously trained on ten games at a time, randomly chosen from a pool of nineteen possible games. Well, not completely simultaneously. It learned one for a while, then switched to another, and so on, just as a human would do. But in the end, the neural network was trained on all ten games. The games were then played to see how well it could play them. This training and then testing of ten random games at a time was repeated such that all nineteen possible games had a chance to be trained.

A sample of the resulting charts taken from their paper is shown here. Click on the charts to see the full nineteen games. The Y-axis shows the game scores as the games are played. Nine of the nineteen games that were learned using the EWC algorithm could play them as well as when only a single game is trained. As a control, the simultaneous training was also done using a normal training algorithm that was subject to catastrophic forgetting (Stochastic Gradient Descent, SGD). The remaining ten games did slightly better or as poorly as the SGD algorithm.

But for a problem that’s been tackled very little over the years, it’s a good start. Given DeepMind’s record, they’re very likely to make big improvements with it. And that of course will spur others on to solving this mostly neglected problem.

So be happy you’re still a biological human who can remember the resistor color codes, and don’t be in a big rush to jump into a silicon brain just yet.