I’ve just completed IBM DB2 for Linux, Unix and Windows (LUW) coverage here on Use The Index, Luke as preparation for an upcoming training I’m giving. This blog post describes the major differences I’ve found compared to the other databases I’m covering (Oracle, SQL Server, PostgreSQL and MySQL).

Free & Easy

Well, let’s face it: it’s IBM software. It has a pretty long history. You would probably not expect that it is easy to install and configure, but in fact: it is. At least DB2 LUW Express-C 10.5 (LUW is for Linux, Unix and Windows, Express-C is the free community edition). That might be another surprise: there is a free community edition. It’s not open source, but it’s free as in free beer.

No Easy Explain

The first problem I stumbled upon is that DB2 has no easy way to display an execution plan. No kidding. Here is what IBM says about it:

Explain a statement by prefixing it with explain plan for This stores the execution plan in a set of tables in the database (you’ll need to create these tables first). This is pretty much like in Oracle.

Display a stored explain plan using db2exfmt This is a command line tool, not something you can fall from an SQL prompt. To run this tool you’ll need shell access to a DB2 installation (e.g. on the server). That means, that you cannot use this tool over an regular database connection.

There is another command line tool ( db2expln ) that combines the two steps from above. Apart from the fact that this procedure is not exactly convenient, the output you get an ASCII art:

Access Plan: ----------- Total Cost: 60528.3 Query Degree: 1 Rows RETURN ( 1) Cost I/O | 49534.9 ^HSJOIN ( 2) 60528.3 68095 /-----+------\ 49534.9 10000 TBSCAN TBSCAN ( 3) ( 4) 59833.6 687.72 67325 770 | | 1.00933e+06 10000 TABLE: DB2INST1 TABLE: DB2INST1 SALES EMPLOYEES Q2 Q1

Please note that this is just an excerpt—the full output of db2exfmt has 400 lines. Quite a lot information that you’ll hardly ever need. Even the information that you need all the time (the operations) is presented in a pretty unreadable way (IMHO). I’m particularly thankful that all the numbers you see above are not labeled—that’s really the icing that renders this “tool” totally useless for the occasional user.

However, according to the IBM documentation there is another way to display an execution plan: “Write your own queries against the explain tables.” And that’s exactly what I did: I wrote a view called last_explained that does exactly what it’s name suggest: it shows the execution plan of the last statement that was explained (in a non-useless formatting):

Explain Plan ------------------------------------------------------------ ID | Operation | Rows | Cost 1 | RETURN | | 60528 2 | HSJOIN | 49535 of 10000 | 60528 3 | TBSCAN SALES | 49535 of 1009326 ( 4.91%) | 59833 4 | TBSCAN EMPLOYEES | 10000 of 10000 (100.00%) | 687 Predicate Information 2 - JOIN (Q2.SUBSIDIARY_ID = DECIMAL(Q1.SUBSIDIARY_ID, 10, 0)) JOIN (Q2.EMPLOYEE_ID = DECIMAL(Q1.EMPLOYEE_ID, 10, 0)) 3 - SARG ((CURRENT DATE - 6 MONTHS) < Q2.SALE_DATE) Explain plan by Markus Winand - NO WARRANTY http://use-the-index-luke.com/s/last_explained

I’m pretty sure many DB2 users will say that this presentation of the execution plan is confusing. And that’s OK. If you are used to the way IBM presents execution plans, just stick to what you are used to. However, I’m working with all kinds of databases and they all have a way to display the execution plan similar to the one shown above—for me this format is much more useful. Further, I’ve made a useful selection of data to display: the row count estimates and the predicate information.

You can get the source of the last_explained view from here or from GitHub (direct download). I’m serious about the no warranty part. Yet I’d like to know about problems you have with the view.

Emulating Partial Indexes is Possible

Partial indexes are indexes not containing all table rows. They are useful in three cases:

To preserve space when the index is only useful for a very small fraction of the rows. Example: queue tables. To establish a specific row order in presence of constant non-equality predicates. Example: WHERE x IN (1, 5, 9) ORDER BY y . An index like the following can be used to avoid a sort operation: CREATE INDEX … ON … (y) WHERE x IN (1, 5, 9) To implement unique constraints on a subset of rows (e.g. only those WHERE active = 'Y' ).

However, DB2 doesn’t support a where clause for indexes like shown above. But DB2 has many Oracle-compatibility features, one of them is EXCLUDE NULL KEYS : “Specifies that an index entry is not created when all parts of the index key contain the null value.” This is actually the hard-wired behaviour in the Oracle database and it is commonly exploited to emulate partial indexes in the Oracle database.

Generally speaking, emulating partial indexes works by mapping all parts of the key (all indexed columns) to NULL for rows that should not end up in the index. As an example, let’s emulate this partial index in the Oracle database (DB2 is next):

CREATE INDEX messages_todo ON messages (receiver) WHERE processed = 'N'

The solution presented in SQL Performance Explained uses a function to map the processed rows to NULL , otherwise the receiver value is passed through:

CREATE OR REPLACE FUNCTION pi_processed(processed CHAR, receiver NUMBER) RETURN NUMBER DETERMINISTIC AS BEGIN IF processed IN ('N') THEN RETURN receiver; ELSE RETURN NULL; END IF; END

It’s a deterministic function and can thus be used in an Oracle function-based index. This won’t work with DB2, because DB2 doesn’t allow user defined-functions in index definitions. However, let’s first complete the Oracle example.

CREATE INDEX messages_todo ON messages (pi_processed(processed, receiver))

This index has only rows WHERE processed IN ('N') —otherwise the function returns NULL which is not put in the index (there is no other column that could be non- NULL ). Voilà: a partial index in the Oracle database.

To use this index, just use the pi_processed function in the where clause:

SELECT message FROM messages WHERE pi_processed(processed, receiver) = ?

This is functionally equivalent to:

SELECT message FROM messages WHERE processed = 'N' AND receiver = ?

So far, so ugly. If you go for this approach, you’d better need the partial index desperately.

To make this approach work in DB2 we need two components: (1) the EXCLUDE NULL KEYS clause (no-brainer); (2) a way to map processed rows to NULL without using a user-defined function so it can be used in a DB2 index.

Although the second one might seem to be hard, it is actually very simple: DB2 can do expression based indexing, just not on user-defined functions. The mapping we need can be accomplished with regular SQL expressions:

CASE WHEN processed = 'N' THEN receiver ELSE NULL END

This implements the very same mapping as the pi_processed function above. Remember that CASE expressions are first class citizens in SQL—they can be used in DB2 index definitions (on LUW just since 10.5):

CREATE INDEX messages_not_processed_pi ON messages (CASE WHEN processed = 'N' THEN receiver ELSE NULL END) EXCLUDE NULL KEYS;

This index uses the CASE expression to map not to be indexed rows to NULL and the EXCLUDE NULL KEYS feature to prevent those row from being stored in the index. Voilà: a partial index in DB2 LUW 10.5.

To use the index, just use the CASE expression in the where clause and check the execution plan:

SELECT * FROM messages WHERE (CASE WHEN processed = 'N' THEN receiver ELSE NULL END) = ?;

Explain Plan ------------------------------------------------------- ID | Operation | Rows | Cost 1 | RETURN | | 49686 2 | TBSCAN MESSAGES | 900 of 999999 ( .09%) | 49686 Predicate Information 2 - SARG (Q1.PROCESSED = 'N') SARG (Q1.RECEIVER = ?)

Oh, that’s a big disappointment: the optimizer didn’t take the index. It does a full table scan instead. What’s wrong?

If you have a very close look at the execution plan above, which I created with my last_explained view, you might see something suspicious.

Look at the predicate information. What happened to the CASE expression that we used in the query? The DB2 optimizer was smart enough rewrite the expression as WHERE processed = 'N' AND receiver = ? . Isn’t that great? Absolutely!…except that this smartness has just ruined my attempt to use the partial index. That’s what I meant when I said that CASE expressions are first class citizens in SQL: the database has a pretty good understanding what they do and can transform them.

We need a way to apply our magic NULL -mapping but we can’t use functions (can’t be indexed) nor can we use CASE expressions, because they are optimized away. Dead-end? Au contraire: it’s pretty easy to confuse an optimizer. All you need to do is to obfuscate the CASE expression so that the optimizer doesn’t transform it anymore. Adding zero to a numeric column is always my first attempt in such cases:

CASE WHEN processed = 'N' THEN receiver + 0 ELSE NULL END

The CASE expression is essentially the same, I’ve just added zero to the RECEIVER column, which is numeric. If I use this expression in the index and the query, I get this execution plan:

ID | Operation | Rows | Cost 1 | RETURN | | 13071 2 | FETCH MESSAGES | 40000 of 40000 | 13071 3 | RIDSCN | 40000 of 40000 | 1665 4 | SORT (UNQIUE) | 40000 of 40000 | 1665 5 | IXSCAN MESSAGES_NOT_PROCESSED_PI | 40000 of 999999 | 1646 Predicate Information 2 - SARG ( CASE WHEN (Q1.PROCESSED = 'N') THEN (Q1.RECEIVER + 0) ELSE NULL END = ?) 5 - START ( CASE WHEN (Q1.PROCESSED = 'N') THEN (Q1.RECEIVER + 0) ELSE NULL END = ?) STOP ( CASE WHEN (Q1.PROCESSED = 'N') THEN (Q1.RECEIVER + 0) ELSE NULL END = ?)

The partial index is used as intended. The CASE expression appears unchanged in the predicate information section.

I haven’t checked any other ways to emulate partial indexes in DB2 (e.g., using partitions like in more recent Oracle versions).

As always: just because you can do something doesn’t mean you should. This approach is so ugly—even more ugly than the Oracle workaround—that you must desperately need a partial index to justify this maintenance nightmare. Further it will stop working whenever the optimizer becomes smart enough to optimize +0 away. However, then you just need put an even more ugly obfuscation in there.

INCLUDE Clause Only for Unique Indexes

With the INCLUDE clause you can add extra columns to an index for the sole purpose to allow in index-only scan when these columns are selected. I knew the INCLUDE clause before because SQL Server offers it too, but there are some differences:

In SQL Server INCLUDE columns are only added to the leaf nodes of the index—not in the root and branch nodes. This limits the impact on the B-tree’s depth when adding many or long columns to an index. This also allows to bypass some limitations (number of columns, total index row length, allowed data types). That doesn’t seem to be the case in DB2.

In DB2 the INCLUDE clause is only valid for unique indexes. It allows you to enforce the uniqueness of the key columns only—the INCLUDE columns are just not considered when checking for uniqueness. This is the same in SQL Server except that SQL Server supports INCLUDE columns on non-unique indexes too (to leverage the above-mentioned benefits).

Almost No NULLS FIRST / LAST Support until 11.1

The NULLS FIRST and NULLS LAST modifiers to the order by clause allow you to specify whether NULL values are considered as larger or smaller than non- NULL values during sorting. Strictly speaking, you must always specify the desired order when sorting nullable columns because the SQL standard doesn’t specify a default. As you can see in the following chart, the default order of NULL is indeed different across various databases:

Although DB2 LUW accepted NULLS FIRST and NULLS LAST for a while it is only fully supported since release 11.1. Previous releases only accepted a NULLS FIRST/LAST clause if the clause happened to request the default order anyway. Even in release 11.1, NULLS FIRST/LAST is not valid in a CREATE INDEX statement.

SQL:2008 FETCH FIRST but not OFFSET (until 11.1)

DB2 supports the fetch first … rows only clause for a while now—kind-of impressive considering it was “just” added with the SQL:2008 standard. However, DB2 doesn’t support the offset clause, which was introduced with the very same release of the SQL standard. Although it might look like an arbitrary omission, it is in fact a very wise move that I deeply respect. offset is the root of so much evil. In the next section, I’ll explain how to live without offset .

UPDATE: DB2 11.1 supports offset even without the below mentioned compatibility vector.

Side node: If you have code using offset that you cannot change, you can still activate the MySQL compatibility vector that makes limit and offset available in DB2. Funny enough, combining fetch first with offset is then still not possible (that would be standard compliant).

Decent Row-Value Predicates Support

SQL row-values are multiple scalar values grouped together by braces to form a single logical value. IN -lists are a common use-case:

WHERE (col_a, col_b) IN (SELECT col_a, col_b FROM…)

This is supported by pretty much every database. However, there is a second, hardly known use-case that has pretty poor support in today’s SQL databases: key-set pagination or offset -less pagination. Keyset pagination uses a where clause that basically says “I’ve seen everything up till here, just give me the next rows”. In the simplest case it looks like this:

SELECT … FROM … WHERE time_stamp < ? ORDER BY time_stamp DESC FETCH FIRST 10 ROWS ONLY

Imagine you’ve already fetched a bunch of rows and need to get the next few ones. For that you’d use the time_stamp value of the last entry you’ve got for the bind value ( ? ). The query then just return the rows from there on. But what if there are two rows with the very same time_stamp value? Then you need a tiebreaker: a second column—preferably a unique column—in the order by and where clauses that unambiguously marks the place till where you have the result. This is where row-value predicates come in:

SELECT … FROM … WHERE (time_stamp, id) < (?, ?) ORDER BY time_stamp DESC, id DESC FETCH FIRST 10 ROWS ONLY

The order by clause is extended to make sure there is a well-defined order if there are equal time_stamp values. The where clause just selects what’s after the row specified by the time_stamp and id pair. It couldn’t be any simpler to express this selection criteria. Unfortunately, neither the Oracle database nor SQLite or SQL Server understand this syntax—even though it’s in the SQL standard since 1992! However, it is possible to apply the same logic without row-value predicates—but that’s rather inconvenient and easy to get wrong.

Even if a database understands the row-value predicate, it’s not necessarily understanding these predicates good enough to make proper use of indexes that support the order by clause. This is where MySQL fails—although it applies the logic correctly and delivers the right result, it does not use an index for that and is thus rather slow. In the end, DB2 LUW (since 10.1) and PostgreSQL (since 8.4) are the only two databases that support row-value predicates in the way it should be.