It’s hard to measure water from a fire hose while it’s hitting you in the face. In a sense, that’s the challenge of analyzing streaming data, which comes at us in a torrent and never lets up. If you’re on Twitter watching tweets go by, you might like to declare a brief pause, so you can figure out what’s trending. That’s not feasible, though, so instead you need to find a way to tally hashtags on the fly.

Computer programs that perform these kinds of on-the-go calculations are called streaming algorithms. Because data comes at them continuously, and in such volume, they try to record the essence of what they’ve seen while strategically forgetting the rest. For more than 30 years computer scientists have worked to build a better streaming algorithm. Last fall a team of researchers invented one that is just about perfect.

“We developed a new algorithm that is simultaneously the best” on every performance dimension, said Jelani Nelson, a computer scientist at Harvard University and a co-author of the work with Kasper Green Larsen of Aarhus University in Denmark, Huy Nguyen of Northeastern University and Mikkel Thorup of the University of Copenhagen.

This best-in-class streaming algorithm works by remembering just enough of what it’s seen to tell you what it’s seen most frequently. It suggests that compromises that seemed intrinsic to the analysis of streaming data are not actually necessary. It also points the way forward to a new era of strategic forgetting.

Trend Spotting

Streaming algorithms are helpful in any situation where you’re monitoring a database that’s being updated continuously. This could be AT&T keeping tabs on data packets or Google charting the never-ending flow of search queries. In these situations it’s useful, even necessary, to have a method for answering real-time questions about the data without re-examining or even remembering every piece of data you’ve ever seen.

Here’s a simple example. Imagine you have a continuous stream of numbers and you want to know the sum of all the numbers you’ve seen so far. In this case it’s obvious that instead of remembering every number, you can get by with remembering just one: the running sum.

The challenge gets harder, though, when the questions you want to ask about your data get more complicated. Imagine that instead of calculating the sum, you want to be able to answer the following question: Which numbers have appeared most frequently? It’s less obvious what kind of shortcut you could use to keep an answer at the ready.

This particular puzzle is known as the “frequent items” or “heavy hitters” problem. The first algorithm to solve it was developed in the early 1980s by David Gries of Cornell University and Jayadev Misra of the University of Texas, Austin. Their program was effective in a number of ways, but it couldn’t handle what’s called “change detection.” It could tell you the most frequently searched terms, but not which terms are trending. In Google’s case, it could identify “Wikipedia” as an ever-popular search term, but it couldn’t find the spike in searches that accompany a major event such as Hurricane Irma.

Jelani Nelson, a theoretical computer scientist at Harvard University, co-developed the new algorithm. Yaphet Teklu

“It’s a coding problem—you’re encoding information down to compact summary and trying to extract information that lets you recover what was put in initially,” said Graham Cormode, a computer scientist at the University of Warwick.

Over the next 30-plus years, Cormode and other computer scientists improved Gries and Misra’s algorithm. Some of the new algorithms were able to detect trending terms, for example, while others were able to work with a more fine-grained definition of what it means for a term to be frequent. All those algorithms made trade-offs, like sacrificing speed for accuracy or memory consumption for reliability.