Computer vision is now a part of everyday life. Facebook recognizes faces in the photos you post to the popular social network. The Google Photos app can find images buried in your collection, identifying everything from dogs to birthday parties to gravestones. Twitter can pinpoint pornographic images without help from human curators.

All of this "seeing" stems from a remarkably effective breed of artificial intelligence called deep learning. But as far as this much-hyped technology has come in recent years, a new experiment from Microsoft Research shows it's only getting started. Deep learning can go so much deeper.

'We're staring at a huge design space, trying to figure out where to go next.' Peter Lee, Microsoft Research

This revolution in computer vision was a long time coming. A key turning point came in 2012, when artificial intelligence researchers from the University of Toronto won a competition called ImageNet. ImageNet pits machines against each other in an image recognition contest—which computer can identify cats or cars or clouds more accurately?—and that year, the Toronto team, including researcher Alex Krizhevsky and professor Geoff Hinton, topped the contest using deep neural nets, a technology that learns to identify images by examining enormous numbers of them, rather than identifying images according to rules diligently hand-coded by humans.

Toronto's win provided a roadmap for the future of deep learning. In the years since, the biggest names on the 'net—including Facebook, Google, Twitter, and Microsoft—have used similar tech to build computer vision systems that can match and even surpass humans. "We can't claim that our system 'sees' like a person does," says Peter Lee, the head of research at Microsoft. "But what we can say is that for very specific, narrowly defined tasks, we can learn to be as good as humans."

Roughly speaking, neural nets use hardware and software to approximate the web of neurons in the human brain. This idea dates to the 1980s, but in 2012, Krizhevsky and Hinton advanced the technology by running their neural nets atop graphics processing units, or GPUs. These specialized chips were originally designed to render images for games and other highly graphical software, but as it turns out, they're also suited to the kind of math that drives neural nets. Google, Facebook, Twitter, Microsoft, and so many others now use GPU-powered-AI to handle image recognition and so many others tasks, from Internet search to security. Krizhevsky and Hinton joined the staff at Google.

Deep learning can go so much deeper.

Now, the latest ImageNet winner is pointing to what could be another step in the evolution of computer vision—and the wider field of artificial intelligence. Last month, a team of Microsoft researchers took the ImageNet crown using a new approach they call a deep residual network. The name doesn't quite describe it. They've designed a neural net that's significantly more complex than typical designs—one that spans 152 layers of mathematical operations, compared to the typical six or seven. It shows that, in the years to come, companies like Microsoft will be able to use vast clusters of GPUs and other specialized chips to significantly improve not only image recognition but other AI services, including systems that recognize speech and even understand language as we humans naturally speak it.

In other words, deep learning is nowhere close to reaching its potential. "We're staring at a huge design space," Lee says, "trying to figure out where to go next."

Layers of Neurons

Deep neural networks are arranged in layers. Each layer is a different set of mathematical operations—aka algorithms. The output of one layer becomes the input of the next. Loosely speaking, if a neural network is designed for image recognition, one layer will look for a particular set of features in an image—edges or angles or shapes or textures or the like—and the next will look for another set. These layers are what make these neural networks deep. "Generally speaking, if you make these networks deeper, it becomes easier for them to learn," says Alex Berg, a researcher at the University of North Carolina who helps oversee the ImageNet competition.

Constructing this kind of mega-neural net is flat-out difficult.

Today, a typical neural network includes six or seven layers. Some might extend to 20 or even 30. But the Microsoft team, led by researcher Jian Sun, just expanded that to 152. In essence, this neural net is better at recognizing images because it can examine more features. "There is a lot more subtlety that can be learned," Lee says.

In the past, according Lee and researchers outside of Microsoft, this sort of very deep neural net wasn't feasible. Part of the problem was that as your mathematical signal moved from layer to layer, it became diluted and tended to fade. As Lee explains, Microsoft solved this problem by building a neural net that skips certain layers when it doesn't need them, but uses them when it does. "When you do this kind of skipping, you're able to preserve the strength of the signal much further," Lee says, "and this is turning out to have a tremendous, beneficial impact on accuracy."

Berg says that this is an notable departure from previous systems, and he believes that others companies and researchers will follow suit.

Deep Difficulty

The other issue is that constructing this kind of mega-neural net is tremendously difficult. Landing on a particular set of algorithms—determining how each layer should operate and how it should talk to the next layer—is an almost epic task. But Microsoft has a trick here, too. It has designed a computing system that can help build these networks.

As Jian Sun explains it, researchers can identify a promising arrangement for massive neural networks, and then the system can cycle through a range of similar possibilities until it settles on this best one. "In most cases, after a number of tries, the researchers learn [something], reflect, and make a new decision on the next try," he says. "You can view this as 'human-assisted search.'"

Microsoft has designed a computing system that can help build these networks.

According to Adam Gibson—the chief researcher at deep learning startup Skymind—this kind of thing is getting more common. It's called "hyper parameter optimization." "People can just spin up a cluster [of machines], run 10 models at once, find out which one works best and use that," Gibson says. "They can input some baseline parameter—based on intuition—and the machines kind of homes in on what the best solution is." As Gibson notes, last year Twitter acquired a company, Whetlab, that offers similar ways of "optimizing" neural networks.

'A Hardware Problem'

As Peter Lee and Jian Sun describe it, such an approach isn't exactly "brute forcing" the problem. "With very very large amounts of compute resources, one could fantasize about a gigantic 'natural selection' setup where evolutionary forces help direct a brute-force search through a huge space of possibilities," Lee says. "The world doesn't have those computing resources available for such a thing...For now, we will still depend on really smart researchers like Jian."

But Lee does say that, thanks to new techniques and computer data centers filled with GPU machines, the realm of possibilities for deep learning are enormous. A big part of the company's task is just finding the time and the computing power needed to explore these possibilities. "This work has dramatically exploded the design space. The amount of ground to cover, in terms of scientific investigation, has become exponentially larger," Lee says. And this extends well beyond image recognition, into speech recognition, natural language understanding, and other tasks.

As Lee explains, that's one reason Microsoft is not only pushing to improve the power of its GPUs clusters, but exploring the use of other specialized processors, including FPGAs—chips that can programmed for particular tasks, such as deep learning. "There has also been an explosion in demand for much more experimental hardware platforms from our researchers," he says. And this work is sending ripples across the wider of world of tech and artificial intelligence. This past summer, in its largest ever acquisition deal, Intel agreed to buy Altera, which specializes in FPGAs.

Indeed, Gibson says that deep learning has become more of "a hardware problem." Yes, we still need top researchers to guide the creation of neural networks, but more and more, finding new paths is a matter of brute-forcing new algorithms across ever more powerful collections of hardware. As Gibson point out, though these deep neural nets work extremely well, we don't quite know why they work. The trick lies in finding the complex combination of algorithms that work the best. More and better hardware can shorten the path.

The end result is that the companies that can build the most powerful networks of hardware are the companies that will come out ahead. That would be Google and Facebook and Microsoft. Those that are good at deep learning today will only get better.