April 16, 2015

Indexes are central to database management.

My first-ever stock analyst report, in 1982, correctly predicted that index-based DBMS would supplant linked-list ones …

… and to this day, if one wants to retrieve a small fraction of a database, indexes are generally the most efficient way to go.

Recently, I’ve had numerous conversations in which indexing strategies played a central role.

Perhaps it’s time for a round-up post on indexing. 🙂

1. First, let’s review some basics. Classically:

An index is a DBMS data structure that you probe to discover where to find the data you really want.

Indexes make data retrieval much more selective and hence faster.

While indexes make queries cheaper, they make writes more expensive — because when you write data, you need to update your index as well.

Indexes also induce costs in database size and administrative efforts. (Manual index management is often the biggest hurdle for “zero-DBA” RDBMS installations.)

2. Further:

A DBMS or other system can index data it doesn’t control. This is common in the case of text indexing, and not just in public search engines like Google. Performance design might speak against recopying text documents. So might security. This capability overlaps with but isn’t exactly the same thing as an “external tables” feature in an RDBMS.

Indexes can be updated in batch mode, rather than real time. Most famously, this is why Google invented MapReduce. Indeed, in cases where you index external data, it’s almost mandatory.

rather than Indexes written in real-time are often cleaned up in batch, or at least asynchronously with the writes. The most famous example is probably the rebalancing of B-trees. Append-only index writes call for later clean-up as well.



3. There are numerous short-request RDBMS indexing strategies, with various advantages and drawbacks. But better indexing, as a general rule, does not a major DBMS product make.



The latest example is my former clients at Tokutek, who just got sold to Percona in a presumably small deal — regrettably without having yet paid me all the money I’m owed. (By the way, the press release for that acquisition highlights TokuDB’s advantages in compression much more than it mentions straight performance.)

In a recent conversation with my clients at MemSQL, I basically heard from Nikita Shamgunov that: He felt that lockless indexes were essential to scale-out, and to that end … … he picked skip lists, not because they were the optimal lockless index, but because they were good enough and a lot easier to implement than the alternatives. (Edit: Actually, see Nikita’s comment below.)

Red-black trees are said to be better than B-trees. But they come up so rarely that I don’t really understand how they work.

solidDB did something cool with Patricia tries years ago. McObject and ScaleDB tried them too. Few people noticed or cared.

I’ll try to explain this paradox below.

4. The analytic RDBMS vendors who arose in the previous decade were generally index-averse. Netezza famously does not use indexes at all. Neither does Vertica, although the columns themselves played some of the role of indexes, especially give the flexibility in their sort orders. Others got by with much less indexing than was common in, for example, Oracle data warehouses.

Some of the reason was indexes’ drawbacks in terms of storage space and administrative overhead. Also, sequential scans can be much faster from spinning disk than more selective retrieval, so table scans often outperformed index-driven retrieval.

5. It is worth remembering that almost any data access method brings back more data than you really need, at least as an intermediate step. For starters, data is usually retrieved in whole pages, whether you need all their contents or not. But some indexing and index-alternative technologies go well beyond that.

To avoid doing true full table scans, Netezza relies on “zone maps”. These are a prominent example of what is now often called data skipping.

Bloom filters in essence hash data into a short string of bits. If there’s a hash collision, excess data is returned.

Geospatial queries often want to return data for regions that have no simple representation in the database. So instead they bring back data for a superset of the desired region, which the DBMS does know how to return.

6. Geospatial indexing is actually one of the examples that gave me the urge to write this post. There are two main geospatial indexing strategies I hear about. One is the R-tree, which basically divides things up into rectangles, rectangles within those rectangles, rectangles within those smaller rectangles, and so on. A query initially brings back the data within a set of rectangles whose union contains the desired region; that intermediate result is then checked row by row for whether it belongs in the final result set.

The other main approach to geospatial indexing is the space-filling curve. The idea behind this form of geospatial indexing is roughly:

For computational purposes, a geographic region is of course a lattice of points rather than a true 2-dimensional continuum.

So you take a lattice — perhaps in the overall shape of a square — and arrange its points in a sequence, so that each point is adjacent in some way to its predecessor.

Then regions on a plane are covered by subsequences (or unions of same).

The idea gets its name because, if you trace a path through the sequence of points, what you get is an approximation to a true space-filling curve.

7. And finally — mature DBMS use multiple indexing strategies. One of the best examples of a DBMS winning largely on the basis of its indexing approach is Sybase IQ, which popularized bitmap indexing. But when last I asked, some years ago, Sybase IQ actually used 9 different kinds of indexing. Oracle surely has yet more. This illustrates that different kinds of indexes are good in different use cases, which in turn suggests obvious reasons why clever indexing rarely gives a great competitive advantage.

Comments