Understanding InnoDB clustered indexes

Some people don’t probably know, but there is a difference between how indexes work in MyISAM and how they work in InnoDB, particularly when talking from the point of view of performance enhancement. Now since, InnoDB is starting to be widely used, it is important we understand how indexing works in InnoDB. Hence, the reason for this post!

The first and foremost thing to know is that InnoDB uses clustered index to store data in the table. Now what does clustered index mean?

Clustered Index

A clustered index determines the physical order of data in a table. When thinking of a clustered index think of a telephone directory, where data is physically arranged by the last name. Because the clustered index decides the physical storage order of the data in the table, a table can only have a single clustered index. But, a clustered index can comprise of multiple columns (a composite index), in the same way as a telephone directory is organized both by the first name and the last name.

Clustered Index with respect to InnoDB

InnoDB stores indexes as B+tree data structures, and same is the case with the clustered index. But the difference is that in the case of clustered index InnoDB actually stores the index and the rows together in the same structure. When a table has a clustered index, its rows are actually stored in the index’s leaf pages. Thus InnoDB tables can also be called index-organized tables.

Now lets consider how InnoDB decides which index to use as the clustered index!

How InnoDB selects a clustered index?

With InnoDB, typically PRIMARY KEY is synonymous with clustered index, but what if a PRIMARY KEY does not exist or there is not even a single index defined on the table. Then following is how InnoDB decides what to use as the clustered index:

If there is a PRIMARY KEY defined on the table, InnoDB uses it as the clustered index.

If there is no PRIMARY KEY defined on the table, InnoDB uses the first UNIQUE index where all the key columns are NOT NULL as the clustered index.

If there is no PRIMARY KEY or no suitable UNIQUE index present, InnoDB internally generates a hidden PRIMARY KEY and then uses this hidden key as the clustered index. This hidden PRIMARY KEY is a 6-byte field that increases monotonically as new rows are inserted.

Hence, my advice is that always define a PRIMARY KEY for each table that you create. If there is no logical key that can be created, add a new auto-increment column, and use it as the PRIMARY KEY.

Did you know that Secondary Index is related to the Primary Key?

In InnoDB, every SECONDARY INDEX contains the PRIMARY KEY column(s) together with the column(s) of the secondary index, automatically. That is because of the way InnoDB stores data, remember what I just told you when talking about how data is stored, a leaf node doesn’t store any pointer to the row’s physical location, but in fact stores the row’s data. So in other words the PRIMARY KEY is actually the pointer to the row data.

This makes us conclude on another interesting conclusion..

A secondary index requires two lookups! First a lookup for the secondary index itself, then a lookup for the primary key.

Advantages of clustering

Clustering provided by InnoDB has very significant performance benefits, some of which are mentioned below:

Because the data is physically stored according to the PRIMARY KEY, data lookups by PRIMARY KEY is very fast. For example, the fastest way to find a particular employee using the unique employee_id column is to create a PRIMARY KEY on the employee_id column.

With clustering, search for ranges can be extremely efficient. Suppose an application frequently searches records between a range of dates, a clustered index can quickly locate the row containing the beginning date, and then retrieve all adjacent rows in the table until the last date is reached. Thus improving the performance of range queries.

Another positive impact of clustering is on the performance of sorting data. Suppose there is a column that is used frequently to sort the data retrieved from a table, it can be advantageous to cluster the table on that column to save the cost of a sort each time the column is queried.

Also because clustered index holds both the index and the data together in one B-Tree, so retrieving rows from a clustered index is normally faster than a comparable lookup in a nonclustered index.

Secondary indexes can act as covering indexes, when the data that is requested include the primary key columns, because of the fact that secondary indexes automatically include primary key columns.

These benefits that I have mentioned can boost performance drastically, if you design your tables and queries accordingly. But clustered indexes have disadvantages as well.

Disadvantages of clustering

Following are some of the disadvantages of clustering:

If a large clustered index is defined, any secondary indexes that are defined on the same table will be significantly larger because the secondary indexes contain the clustering key.

Because of the way how the data is stored, secondary indexes require two lookups.

Clustered index can be expensive for columns that undergo frequent changes because it forces InnoDB to move each updated row to a new location.

Insertions can be slow, if the data is not inserted in PRIMARY KEY order, hence we can conclude that insert speeds depend heavily on insertion order. Inserting rows in primary key order is the fastest way to load data into an InnoDB table.

Update (thanks to sunny):

Following is another thing that one should know regarding secondary indexes:

The records in InnoDB secondary are never updated in place. Therefore, what that means is that an UPDATE of a secondary index column means deleting the old record and inserting a new one.

Although, I did point out some disadvantages, but the fact is that these disadvantages can not be weighted down by the tremendous amount of benefits that comes with clustering in InnoDB. If you study and understand the aspects that I have mentioned in this article and apply them accordingly, you are going to see great performance enhancements. After all, clustering is another important step in bringing MySQL closer to MSSQL and Oracle.