These days, most announcements by tech companies are pretty meh. Details either leak months ahead of time or reveal themselves to be pretty unimpressive. But lately, we've had some real surprises. Months ahead of releasing the Switch this spring, Nintendo decided the future of consoles was its past with the NES Classic. (any pixel power watchers be damned). And when Google's AI-powered AlphaGo defeated Lee Se-dol in a best-of-five Go competition, that victory ran counter to experts who believed such results were at least a decade away.

Amazon's December 2016 announcement of Amazon Go —a retail store where you could simply walk in, grab items, and walk out—was another shocker in that AlphaGo vein. Grab-and-go has been the "future of retail" and “just a few years away” for a while. I worked in robotics research for over a decade at Caltech, Stanford, and Berkeley, and now I run a startup making outdoor home security cameras . Computer vision has made up a lot of my work in recent years. Yet just a few months before the Amazon announcement, I confidently told someone that it would take a few more years to get a grab-and-go retail experience to consumers. And I wasn’t alone in thinking this way; Planet Money had an entire episode on self-checkout just two months earlier.

So when Amazon went and surprised us all by going and building the thing, the first question was obvious: how will it work? The launch video drops buzz words like computer vision, deep learning, and sensor fusion. But what did all that mean, and how would you actually put these things together?

I’ll start by killing the suspense: I don’t actually know. I'm not directly involved in these Amazon efforts, and the company hasn’t publicly stated how its technology works. But given my work and research in the area of computer vision, it's possible to make some very educated guesses. And at its core, Amazon Go clearly seems like a product of the same fundamental advances in artificial intelligence, computer vision, and automated decision-making that are behind AlphaGo and the sudden explosion of self-driving cars. Advances in statistics and parallel computing these last five years in particular have created an inflection point in machine intelligence.

They're the big reasons why these surprise autonomous advances have been coming in waves—and why you may be letting a car drive you to grab a carton of milk sans human interaction much earlier than anyone thought.

The cart

To understand what's behind an ecosystem like Amazon Go, the best place to start is to specify the problem. For a grocery store, there is one primary question Amazon needs to answer: when a customer leaves the store, what are they taking with them? To put it another way, what's in the shopping cart?

When you boil it down, there are only two ways to answer. Amazon needs to either look in the cart when a customer leaves or keep track of what's going into the cart. The first approach is what we call a checkout line, and it's what most retail stores do today (examine the customer's items as they leave, all at once). The other approach I'll call the bar tab. Like a bartender who keeps a tab of every drink a customer orders, a business can figure out what's in our customer's cart by keeping track of what goes in or comes out. If done perfectly, you'll know exactly what's in a customer's cart without ever forcing consumers to show their items.

Of course, Amazon Go is no ordinary grocery store. It aims to not only know what's in any given cart, the system needs to know whom to charge as well. To charge the right customer in a cashier/checkout line-less world, you must track customer identity, too.

So how will Amazon do all this? How will the company keep track of people in the store, and what they’re picking up and putting down, without making mistakes? The answer obviously starts with cameras. They’re unobtrusive and cheap, and you can put them everywhere. Amazon told us as much by mentioning computer vision in their video. But how do you make sense of what the cameras see and use that to track shoppers and their actions? That’s where the second set of buzzwords comes in: deep learning.

Neurons to nibbles

The idea of using cameras in the checkout process has been around for a long time, but until now it was mostly just that—an idea.

Previously, vision algorithms generally worked by finding salient features in an image and gathering collections into objects. An image could be processed to extract lines, corners, and edges. If you see four lines and four corners in a particular pattern, you’ve found a square (or a rectangle). The same basic principles could be used to identify and track much more complicated objects using more complex features and collections. The sophistication of the vision algorithms depended on the sophistication of the features and on the techniques used to recognize particular collections of features as objects.

So for a long time, the most exciting progress in computer vision and machine learning came by way of researchers inventing better and more complex features. Lines and corners gave way to wavelets and gaussian blurs, giving features esoteric names like SIFT and SURF. For a while, the best feature to use for identifying humans in images was named HOG. But it quickly became obvious that meticulously crafting features by hand was hitting limits.

These feature-based algorithms were fantastic at recognizing a specific thing they'd seen before. Show an algorithm images of a pack of Coca-Cola, and it becomes the world's leading expert on recognizing six-packs of Coke. But these algorithms were still terrible at generalizing; it was much harder to have it recognize soda overall or the larger world of drinks.

Even worse, systems built like this were brittle and hard to improve. Correcting mistakes often required careful, painstaking tweaking of the logic by a team of PhDs who were the only ones who understood the algorithm. For a grocery store, you might not care if your cameras were smart enough to recognize that a bottle of coke and a bottle of Pepsi were both sodas, but you’d certainly care if they mistook bottles of $20 wine for $2 sodas.

Today's deep learning abilities are an explicit rejection of this kind of hand-tuned feature finding. Instead of trying to come up with features by hand, you use massive amounts of data to train a neural network. Given examples of what it’s supposed to recognize, the network comes up with its own features. Low-level neurons learn to recognize simple features like lines, and their outputs feed upward into neurons that would combine these into more complex features like shapes in a stacked, hierarchical architecture.

There’s no need to specify what features the neurons recognize, they simply emerge as part of the training process. The neurons figure out the best patterns to be sensitive to. If you were trying to create a system to recognize sodas, you show it tens of thousands of pictures of sodas, and it would figure out for itself how to go from lines and curves into shapes and then into boxes and bottles and so on.

It is conceptually the same way that our brains work, and correcting errors also happens in a more human fashion. By example. If your neural network was having trouble with wine and soda, you could correct it by finding a few thousand more examples of each in those circumstances and train the network on those examples. It would figure out for itself how to tell the difference.

Using software to simulate neurons has been around for decades, but applying it to vision was always more of a theoretical curiosity than a practical approach. To mimic anything like an animal’s vision, you'd need anywhere from dozens to hundreds of layers of neurons or more, with tens of thousands of neurons in each layer. With every additional layer of neurons, the connections between them grow exponentially. A lot of computing power would be needed to run these networks, and even more data would be needed to train them.

To make a simulated network that could run in a reasonable amount of time requires fine-tuning of the structure of the networks to minimize the number of interconnects. But even then, a lot of horsepower is needed.