A Trillion Dollar Math Trick



How linear algebra can make databases go really fast



Mike Stonebraker is one of the world’s leading expert on database technology. He started in academe at Berkeley, is now again in academe at MIT, and has launched a raft of successful companies. He is currently the co-founder and chief technology officer of at least three startups in the database area. One is called “Data Tamer” and is a joint venture with researchers from QCRI—Qatar Computing Research Institute—see this release.

Today I would like to talk about his presentation at the TTI Vanguard meeting on “Ginormous Systems” in DC. His talk was on “Seven Big Data Megatrends.”

By the way the word “Ginormous” is a real word–see here for the formal definition. I initially thought the Vanguard organizers had made up the word, but it is real. It should be obvious that Ginormous means large, actually really Large . This Vanguard meeting was dedicated to Ginormous systems of all kinds: from huge data centers, to city-wide systems, to supercomputers, and much more.

In Mike’s wonderful talk he made seven points about the past, present, and the future of database technology. He has a great track record, so likely he is mostly right on his guesses. One of his predictions was about a way of re-organizing databases that has several remarkable properties:

It speeds up database operations 50x. That is to say, on typical queries—ones that companies actually do—it is fifty times faster than classical database implementations. As a theorist we like speedups, especially asymptotic ones. But 50x is pretty cool. That is enough to change a query from an hour to a minute.

It is not a new idea. But the time is finally right, and Mike predicts that future databases will use this method.

It is an idea that no one seems to know who invented it. I asked Mike, I asked other experts at the conference, and all shrugged and said effectively: “I have no idea.” Curious.

Let’s look quickly at the way databases work, and then consider the trick.

Some Background

Modern databases store records—lots of them, usually on disks. A record is a fixed-size vector of information. The vector is divided into fields, where a field stores a type of information. An example is:

Thus the first field is the person’s name, the next the address, and so on.

In a sense the data is really stored in a table—or an array if you wish to be mathematical—call it for data. The rows contain each record, and the columns store the fields.

The issue is how the array is stored on the disk. Each record is stored one after the other on the disk. The records are stored as

Here each is the row.

This is a reasonable method, it puts each record together, and allows fast access of all of the records. Thus, a query can scan over all the records by reading the disk one track at a time. This is not a bad way to use a disk-like device.

Mike points out that all the classic database systems—well at least most—store records in this manner. Their code, which also is huge (if not ginormous) is tuned to handle data that is stored in this manner. Let’s call it the “record ordered method” (ROM). As a mathematical idea it is just storing the array in row-major order. Not only is this a perfectly fine way to organize the data, and to store the array, it respects principles that go back to COBOL in the 1950’s: Each data object should be conceptually and physically together.

But there is a better way.

The Trick

The trick to the 50x speedup is based on the deep, advanced, complex operation that we in math call the transpose of a matrix. Just kidding. It is based on the simple idea that instead of storing the matrix we store the matrix . Recall is just the matrix defined by

Let’s call this the column ordered method: COM. Now the data on the disk contains

Here each is the column.

So why is this method so much faster than the ROM? The answer is how the data is accessed by the queries. The data is read much more than it is written, so the key is to speed up the ability to read the data. But the critical insight is this:

A query is likely to use only a few columns.

For example, suppose the query is:

Select all the records with age in the range [21,31] and cell phones with area code 404.

Then the query needs only to look at two columns. All the other fields are completely un-needed.

Now suppose the records have a hundred fields. Since the query only looks at two fields there is a huge speedup. Then the speedup is roughly. In the COM the database algorithm only reads the data that it needs to use to answer the query. In the ROM method it reads all the data and that tremendously slows down the query. Note, things can be even worse, since the size of fields can vary widely. So the true speedup depends on the ratio of

Clearly if a record has even one large field that is not used in the query, the speedup could be very large.

How did people not realize this simple idea: replace the table by its transpose ? Well they did not actually miss it, but its power was not realized until relatively recently.

Whose Trick?

As I stated earlier no one seems to be able to say who exactly discovered the COM. Maybe as a default we could call it the Gauss Database Method, since most things are named for him. I did track down a system called TAXIR that was essentially a COM storage system with focus on information-retrieval in biology in 1969. The paper describing it is by George Estabrook and Robert Brill. Maybe they invented it. Perhaps their focus on biology made it hard for those in databases to notice their work? Especially years ago before powerful on-line search engines. Perhaps.

Ken adds that in a textbook used years ago for Buffalo’s course on programming-language concepts, the COM idea was called “parallel arrays” and was frowned upon. The main reason given was that this structure was hard to maintain, as a single off-by-one indexing error in one array could damage the entire set of records. However, a high-level system can maintain the data in-sync, while modern machine architectures increase the reward for keeping just the data you need in caches.

Open Problems

Okay, maybe the trick is not worth a trillion dollars. But the total amount invested yearly in data systems suggests that the column idea could over the next few years be worth quite a few dollars.

A simple thought: Is the column method the best way to store records? Can we in theory prove it is best in some sense, or is there an even better method? So forget the million-dollar Clay prizes and go after the real money.