Streaming Models both Old and New



Streaming models were started decades ago and continue to be even more exciting



Ian Munro is one of the world’s experts on data structures. If you need to know anything about data structures, either lower bounds or upper bounds, he is one of the few people I would call for advice. If he does not know, or he does not know who knows, or he cannot suggest an approach, you are in deep trouble.

Today I will present some work that Atish Das Sarma and Danupon Nanongkai have recently done on a variation of the streaming model, which was just presented at TAMC 2009. Ian also wrote one of the first papers on streaming, before there was streaming. More on this later.



Atish and Danupon are two graduate students who are working with me, but they are both quite independent and have already written papers with various colleagues from various places on various topics. How is that for a vague statement. In any event, I want to come clean, since I worked with them on their results.

Ian Munro is a long-time friend, and I have many stories about him, which could fill up an almost infinite number of posts.

Ian’s great sense of humor always makes it a pleasure to work with him. Whenever we have worked on a problem and gotten stuck, Ian has a trick that he uses to get us back on track. He says,

“let’s make a list of what we need to do.”

He then proceeds to take a piece of paper and writes on the paper:

. . . Make a list.

where are the detailed things that we need to do in order to solve the problem in question. For example, the first item might be: Find an algorithm that converts any into a in linear time. He then takes the list and with his pen writes on it: now the list is,

. . . Make a list.

He then would say,

“well we can take a break, since we have done 25 per cent of the work.”

And off we go to get a drink, and hopefully also get an inspiration.

One of my other favorite Ian stories is about the time he got tenure. At STOC that year, he told a group of us that he had just gotten tenure. We all congratulated him—it was wonderful news. He then proceeded to point out that at Waterloo—where he was and still is—tenure is not official until the next trustee meeting. The trustee’s signing off on all the year’s tenure decisions was, of course, pro forma. But, it would not happen for about a month.

Ian then raised an interesting “open problem:” what could he do, in the next month, to get his tenure decision reversed? What behavior would undo tenure? Only Ian. It was a fun discussion, as we argued over which behaviors would lead to tenure being revoked, and which behaviors would not. The group sometimes agreed that this behavior would lead to a revoke, and that behavior would not. Sometimes we were divided. Thankfully, Ian decided not to test the various hypotheses and he got his tenure.

Let’s now turn to talking about the streaming model, and the work of Atish and Danupon.

The Streaming Model

The streaming model is both new and old. In computer science we sometimes create entire new areas of study. However, more often than not, a hot new area is an old one in disguise—or an old one with a slight twist. I think this is the case with the current area that is called streaming. As you probably know, in the basic streaming model, data is send one bit after another to the algorithm, which must compute the answer after the last bit has been sent.

So far the model does not restrict the algorithm at all—just have the algorithm store the input, and then run any method to solve the given problem. This strategy would make the streaming model trivial, so the rules of streaming are that the algorithm is always space limited. That is, if the stream is bits long, then the algorithm is only allowed to store bits, where is always much smaller than . For example, might be restricted to be or even to be polylog in : that is might have to be bounded by .

The space restriction makes the model interesting, but does it make the model new? I think no and yes. The basic setup is nothing more than asking: what is the smallest number of states that are needed by a finite state automata (FSA) to solve the given problem? You know I love FSA, and have previously shown their power.

So what is new with streaming? Partly it’s that usually streaming algorithms use randomness, but we could of course replace a FSA by a random finite state automata. So that is not new, since such models have been studied for decades.

I think streaming is new for several reasons—even if the basic model is not really new. The reasons are:

The motivation is new. The size of inputs has increased to the point where they are extremely large. So large that the basic streaming model is no longer a theoretic model, but has become a practical model. If the data you wish to process is huge, then you may be forced to use the streaming model. For example, the new accelerator at CERN will generate so much sensor data that they plan on using streaming algorithms—the amount of data created per second is so large that they really have no choice.

The the class of problems is different. Studying FSA’s involved very different problems, which were often artificial ones. Now the main problems are more “real’ problems: graph problems, linear algebra problems, statistical problems, and many others.

The results are new. Many new and extremely clever ideas have been found by researchers working on the streaming model. The general streaming framework has been studied extensively since the seminal work of Noga Alon, Yossi Matias, and Mario Szegedy. Their beautiful results and new ideas excited the theory community tremendously, which led to their receiving the Gödel prize in 2005 for their work.

The best reason—in my opinion—is that the streaming model generalizes the notion of input that was previously used in the study of FSA’s. Streaming models allow more complex rules on how the data arrives to the algorithm. Let’s call this the order type of the streaming model. I will, in the next section, explain this in detail.

Yet, already in 1978 Ian Munro and Mike Paterson had results on computing the median in a model that was essentially the streaming model. They proved:

Theorem: The median of numbers can be computed in space with high probability in one pass over the numbers.

And,

Theorem: The median of numbers can be computed in space deterministically with two passes over the numbers.

So is streaming new? You decide.

Order Types of Streaming Models

In the classic FSA model the order of the data is set by the definition of the problem. I have already talked about BDD’s which allow the input order to be selected so that the task of solving the problem is as easy as possible. Streaming models allow an even richer class of order types, which is the principle reason they are exciting.

To explain the notion of order type let us focus on problems that concern matrices. The general ideas apply to many other areas, but I think that the explanation will be easier if we focus first on matrices. Let be an matrix of integer values. We assume that entries of the form make up the input: means that

Here are some possible input order types:

Natural: This is an order that only depends on the input being a matrix. Some choices include row order and column order. Note the order has nothing to do with the problem being considered.

Best Oblivious: This is the order that is used by BDD’s. The entries of the matrix are sent in the manner that is best for the problem. However, the order can only depend on the problem, not on the actual matrix . This is why the order is called “oblivious.”

Adversarial: This order allows an adversary to look at the input matrix and then decide what order to send the matrix entries to the algorithm. The order thus can be non-oblivious, and the adversary can try to make the input ordering as nasty as possible. This is the most popular, and standard order type for streaming models.

Random: This is just what you would guess: the entries of are send in a random order.

Sorted: This order sends the entries of after they have been sorted according to some rule. For example, the order could be: send the entries sorted first by the value, and then by the indices.