What’s a good natural key?

This is a very difficult question for most entities when you design your schema. In some rare cases, there seems to be an “obvious” candidate, such as a variety of ISO standards, including:

But even in those cases, there might be exceptions and the worst thing that can happen is a key change. Most database designs play it safe and use surrogate keys instead. Nothing wrong with that. But…

Relationship tables

There is one exception where a surrogate key is never really required. Those are relationship tables. For example, in the Sakila database, all relationship tables lack a surrogate key and use their respective foreign keys as a compound “natural” primary key instead:

So, the FILM_ACTOR table, for example, is defined as such:

CREATE TABLE film_actor ( actor_id int NOT NULL REFERENCES actor, film_id int NOT NULL REFERENCES film, CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id) );

There is really no point in adding another column FILM_ACTOR_ID or ID for an individual row in this table, even if a lot of ORMs and non-ORM-defined schemas will do this, simply for “consistency” reasons (and in a few cases, because they cannot handle compound keys).

Now, the presence or absence of such a surrogate key is usually not too relevant in every day work with this table. If you’re using an ORM, it will likely make no difference to client code. If you’re using SQL, it definitely doesn’t. You just never use that additional column.

But in terms of performance, it might make a huge difference!

Clustered indexes

In many RDBMS, when creating a table, you get to choose whether to use a “clustered index” or a “non clustered index” table layout. The main difference is:

Clustered index

… is a primary key index that “clusters” data together, which belongs together. In other words:

All the index column values are contained in the index tree structure

All the other column values are contained in the index leaf nodes

The benefit of this table layout is that primary key lookups can be much faster because your entire row is located in the index, which requires less disk I/O than the non clustered index for primary key lookups. The price for this is slower secondary index searches (e.g. searching for last names). The algorithmic complexities are:

O(log N) for primary key lookups

for primary key lookups O(log N) for secondary key lookups plus O(M log N) for projections of non-secondary-key columns (quite a high price to pay)

… where

N is the size of the table

is the size of the table M is the number of rows that are searched in secondary keys

OLTP usage often profits from clustered indexes.

Non clustered index

… is a primary key index that resides “outside” of the table structure, which is a heap table. In other words:

All the index column values are contained in the index tree structure

All the index column values and other column values are contained in the heap table

The benefit of this table layout is that all lookups are equally fast, regardless if you’re using a primary key lookup or a secondary key search. There’s always an additional, constant time heap table lookup. The algorithmic complexities are:

O(log N) for primary key lookups plus O(1) for projections of non-primary-key columns (a moderate price to pay)

for primary key lookups plus for projections of non-primary-key columns (a moderate price to pay) O(log N) for secondary key lookups plus O(M) for projections of non-secondary-key columns (a moderate price to pay)

OLAP usage definitely profits from heap tables.

Defaults

MySQL’s InnoDB offers clustered indexes only.

MySQL’s MyISAM offers heap tables only.

Oracle offers both and defaults to heap tables

PostgreSQL offers both and defaults to heap tables

SQL Server offers both and defaults to clustered indexes

Note that Oracle calls clustered indexes “index organised tables”

Performance

In this article, I’m checking MySQL’s performance as MySQL’s InnoDB doesn’t offer to switch the table layout. Curiously, the problems shown below could not be reproduced on PostgreSQL as shown by reddit user /u/ForeverAlot. Details here.

With the algorithmic complexities above, we can easily guess what I’m trying to hint at here. In the presence of a clustered index, we should avoid expensive secondary key searches when possible. Of course, these searches cannot always be avoided, but if we review the alternative design of these two tables:

CREATE TABLE film_actor_surrogate ( id int NOT NULL, actor_id int NOT NULL REFERENCES actor, film_id int NOT NULL REFERENCES film, CONSTRAINT film_actor_surrogate_pkey PRIMARY KEY (id) ); CREATE TABLE film_actor_natural ( actor_id int NOT NULL REFERENCES actor, film_id int NOT NULL REFERENCES film, CONSTRAINT film_actor_pkey PRIMARY KEY (actor_id, film_id) );

… we can see that if we’re using a clustered index here, the clustering will be made based on either:

FILM_ACTOR_SURROGATE.ID , which is a very useless clustering

, which is a very useless clustering (FILM_ACTOR_NATURAL.ACTOR_ID, FILM_ACTOR_NATURAL.FILM_ID) , which is a very useful clustering

In the latter case, whenever we look up an actor’s films, we can use the clustering index as a covering index, regardless if we project anything additional from that table or not.

In the former case, we have to rely on an additional secondary key index that contains (ACTOR_ID, FILM_ID) , and chances are that secondary index is not covering if we have additional projections.

The surrogate key clustering is really useless, because we never use the table this way.

Does it matter?

We can easily design a benchmark for this case. You can find the complete benchmark code here on GitHub, to validate the results on your environment. The benchmark uses this database design:

create table parent_1 (id int not null primary key); create table parent_2 (id int not null primary key); create table child_surrogate ( id int auto_increment, parent_1_id int not null references parent_1, parent_2_id int not null references parent_2, payload_1 int, payload_2 int, primary key (id), unique (parent_1_id, parent_2_id) ) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */ ; create table child_natural ( parent_1_id int not null references parent_1, parent_2_id int not null references parent_2, payload_1 int, payload_2 int, primary key (parent_1_id, parent_2_id) ) -- ENGINE = MyISAM /* uncomment to use MyISAM (heap tables) */ ;

Unlike in the Sakila database, we’re now adding some “payload” to the relationship table, which is not unlikely. Recent versions of MySQL will default to InnoDB, which only supports a clustered index layout. You can uncomment the ENGINE storage clause to see how this would perform with MyISAM, which only supports heap tables.

The benchmark adds:

10 000 rows in PARENT_1

100 rows in PARENT_2

1 000 000 rows in both CHILD tables (just a cross join of the above)

And then, it runs 5 iterations of 10000 repetitions of the following two queries, following our standard SQL benchmark technique:

-- Query 1 SELECT c.payload_1 + c.payload_2 AS a FROM parent_1 AS p1 JOIN child_surrogate AS c ON p1.id = c.parent_1_id WHERE p1.id = 4; -- Query 2 SELECT c.payload_1 + c.payload_2 AS a FROM parent_1 AS p1 JOIN child_natural AS c ON p1.id = c.parent_1_id WHERE p1.id = 4;

Notice that MySQL does not implement join elimination, otherwise, the useless join to PARENT_1 would be eliminated. The benchmark results are very clear:

Using InnoDB (clustered indexes)

Run 0, Statement 1 : 3104 Run 0, Statement 2 : 1910 Run 1, Statement 1 : 3097 Run 1, Statement 2 : 1905 Run 2, Statement 1 : 3045 Run 2, Statement 2 : 2276 Run 3, Statement 1 : 3589 Run 3, Statement 2 : 1910 Run 4, Statement 1 : 2961 Run 4, Statement 2 : 1897

Using MyISAM (heap tables)

Run 0, Statement 1 : 3473 Run 0, Statement 2 : 3288 Run 1, Statement 1 : 3328 Run 1, Statement 2 : 3341 Run 2, Statement 1 : 3674 Run 2, Statement 2 : 3307 Run 3, Statement 1 : 3373 Run 3, Statement 2 : 3275 Run 4, Statement 1 : 3298 Run 4, Statement 2 : 3322

You shouldn’t read this as a comparison between InnoDB and MyISAM in general, but as a comparison of the different table structures within the boundaries of the same engine. Very obviously, the additional search complexity of the badly clustered index in CHILD_SURROGATE causes a 50% slower query execution on this type of query, without gaining anything.

In the case of the heap table, the additional surrogate key column did not have any significant effect.

Again, the full benchmark can be found here on GitHub, if you want to repeat it.

Conclusion

Not everyone agrees what is generally better: clustered or non clustered indexes. Not everyone agrees on the utility of surrogate keys on every table. These are both quite opinionated discussions.

But this article clearly showed that on relationship tables, which have a very clear candidate key, namely the set of outgoing foreign keys that defines the many-to-many relationship, the surrogate key not only doesn’t add value, but it actively hurts your performance on a set of queries when your table is using a clustered index.

MySQL’s InnoDB and SQL Server use clustered indexes by default, so if you’re using any of those RDBMS, do check if you have room for significant improvement by dropping your surrogate keys.