MySQL uses a cost-based optimizer to determine the best way to execute a query. It usually does a great job, but not all the time. There are different reasons for that. In some cases, it might not have enough information about the data and plan queries in a non-optimal way.

The optimizer makes decisions based on statistics and some fixed costs per operation, but it does not understand the differences in hardware. For instance, disk access may have different costs depending on the type of storage used - SSD drives will have quicker access times than spindles, and can perform more operations in a given time.

However, it is possible to impact how a query will be executed, and this is the topic of today’s blog.

(Note that in MySQL 5.7, a lot of work has been done in order to improve this - users can modify the cost of different types of operations.)

This is the fourteenth installment in the ‘Become a MySQL DBA’ blog series. Our previous posts in the DBA series include Using EXPLAIN to improve SQL Queries, Database Indexing, Deep Dive pt-query-digest, Analyzing SQL Workload with pt-query-digest, Query Tuning Process, Configuration Tuning, Live Migration using MySQL Replication, Database Upgrades, Replication Topology Changes, Schema Changes, High Availability, Backup & Restore, Monitoring & Trending.

Index statistics

One of the ways we can influence the way a query is going to be executed is by using index hints. The optimizer makes decisions about the best index for a query, and this is based on index statistics provided to it by the InnoDB engine. Let’s first see how InnoDB statistics work and how we can change it.

Historically index statistics were recalculated from time to time. It happened when someone explicitly executed ANALYZE TABLE or at the first time table was opened. But it also happened when SHOW TABLE STATUS, SHOW TABLES or SHOW INDEX were executed. In addition to this, table statistics were updated when either 1/16th or 2 billion rows were modified in a table. This introduced a bit of instability. To calculate statistics, InnoDB performs a lookup into 8 (yes, eight!) index pages. This is 128k of data to calculate stats for, let’s say, 100G index. In one way, this makes sense - the more index lookups you make, the longer it takes to update index statistics and more I/O is needed for this - not something you’d like to see. On the other hand, it’s rather obvious that such a small sample may introduce large variations in the final result. Changes in InnoDB statistics, though, impact query execution plans. It is possible to change this setting by changing innodb_stats_sample_pages to get a bit more stable and ‘closer to reality’ statistics - but it comes at the price of more I/O.

Since MySQL 5.6.6, InnoDB statistics can be (and this is a default setting) persistent. Statistics are not recalculated for every SHOW TABLE STATUS and similar commands. They are updated when an explicit ANALYZE TABLE is run on the table or more than 10% of rows in the table were modified. This threshold can be modified using innodb_stats_auto_recalc variable.

When persistent statistics are enabled, InnoDB performs a lookup on 20 index pages to calculate them. It’s a bit more than what we had before. Stats are not calculated that often, though and the impact of additional I/O when collecting them is not that high. Query plans should also be more stable because the underlying stats are more stable than before. Again, it is possible to alter this default setting by changing the innodb_stats_persistent_sample_pages variable.

It is possible to disable persistent stats and revert back to the old behavior by disabling innodb_stats_persistent - in some corner cases, this may be the best option.

Index hints

As we now know, index statistics are just an estimate. It may happen that a query execution plan is not optimal or (even worse) it’s flapping between several versions. This is serious as it causes unstable performance.

Luckily, MySQL gives us the ability to amend query execution plans when we find them not suitable for our queries. Let’s see how we can do it.

In this example we’ll be using ‘sakila’ schema. Assume the following query execution plan:

mysql> EXPLAIN SELECT inventory_id FROM rental WHERE rental_date > '2005-05-24 00:00:00' AND rental_date < '2005-05-28 00:00:00' AND customer_id IN (234, 123, 412, 23)\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: rental type: range possible_keys: rental_date,idx_fk_customer_id key: rental_date key_len: 5 ref: NULL rows: 483 Extra: Using where; Using index 1 row in set (0.00 sec)

Here, we want to find the inventory_id of rentals done by some customers in some time period. The optimizer has two options when it comes to indexes - it can use rental_date key or idx_fk_customer_id. It has chosen rental_date to perform a range scan on it. Very likely it was motivated by the fact that it’s also covering index in this particular case.

Let’s say we don’t really want to do more I/O than we have to and we want to use index on ‘customer_id’ column to perform index lookups. Please keep in mind this change will also change the access pattern for I/O operations - instead of index scan + covering index (this is a sequential access) we’ll be performing random reads. Such change may not be ideal for spindles. SSD’s won’t notice it, though. We can enforce our choice on the optimizer by executing one of following queries:

mysql> EXPLAIN SELECT inventory_id FROM rental FORCE INDEX(idx_fk_customer_id) WHERE rental_date > '2005-05-24 00:00:00' AND rental_date < '2005-05-28 00:00:00' AND customer_id IN (234, 123, 412, 23)\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: rental type: range possible_keys: idx_fk_customer_id key: idx_fk_customer_id key_len: 2 ref: NULL rows: 101 Extra: Using index condition; Using where 1 row in set (0.00 sec) mysql> EXPLAIN SELECT inventory_id FROM rental USE INDEX(idx_fk_customer_id) WHERE rental_date > '2005-05-24 00:00:00' AND rental_date < '2005-05-28 00:00:00' AND customer_id IN (234, 123, 412, 23)\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: rental type: range possible_keys: idx_fk_customer_id key: idx_fk_customer_id key_len: 2 ref: NULL rows: 101 Extra: Using index condition; Using where 1 row in set (0.00 sec) mysql> EXPLAIN SELECT inventory_id FROM rental IGNORE INDEX(rental_date) WHERE rental_date > '2005-05-24 00:00:00' AND rental_date < '2005-05-28 00:00:00' AND customer_id IN (234, 123, 412, 23)\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: rental type: range possible_keys: idx_fk_customer_id key: idx_fk_customer_id key_len: 2 ref: NULL rows: 101 Extra: Using index condition; Using where 1 row in set (0.00 sec)



What we have done is to use ‘USE INDEX’, FORCE INDEX’ and ‘IGNORE INDEX’ hints.

‘USE INDEX’ tells the optimizer that it should use one of the listed indexes. In our case we listed only one and therefore it was used.

‘FORCE INDEX’ is basically the same behavior as ‘USE INDEX’, with the exception that with ‘USE INDEX’, the optimizer may still choose to use full table scan for our query. With ‘FORCE INDEX’, a full table scan is marked as extremely expensive operation and therefore won’t be used by the optimizer - as long as any of the listed indexes could be used for our particular query.

‘IGNORE INDEX’ tells the optimizer which indexes we don’t want. In our case we listed ‘rental_date’ as the index we want to avoid. Therefore it decided to choose another option in the query execution plan.

In another example we’ll use the ‘sakila’ schema again with one additional change:

ALTER TABLE actor ADD KEY idx_actor_first_name (first_name);

Let’s assume the following query:

mysql> EXPLAIN SELECT title, description FROM film AS f JOIN film_actor AS fa ON f.film_id = fa.film_id JOIN actor AS a ON fa.actor_id = a.actor_id WHERE a.last_name = 'MIRANDA' AND f.release_year = 2006 ORDER BY a.first_name\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: a type: ref possible_keys: PRIMARY,idx_actor_last_name key: idx_actor_last_name key_len: 137 ref: const rows: 1 Extra: Using index condition; Using where; Using filesort *************************** 2. row *************************** id: 1 select_type: SIMPLE table: fa type: ref possible_keys: PRIMARY,idx_fk_film_id key: PRIMARY key_len: 2 ref: sakila.a.actor_id rows: 1 Extra: Using index *************************** 3. row *************************** id: 1 select_type: SIMPLE table: f type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 2 ref: sakila.fa.film_id rows: 1 Extra: Using where 3 rows in set (0.00 sec)

Assume now that we want to use an index for sorting the result set, not for lookups on the ‘last_name’ column in ‘actor’ table, as the optimizer decided. Let’s put aside any discussions on whether such change makes any sense here (because it doesn’t).

What we can do is to ensure ‘idx_actor_last_name’ index is not used for JOIN and that we’ll use ‘idx_actor_first_name’ index for ORDER BY. We can do it like this:

mysql> EXPLAIN SELECT title, description FROM film AS f JOIN film_actor AS fa ON f.film_id = fa.film_id JOIN actor AS a IGNORE INDEX FOR JOIN (idx_actor_last_name) FORCE INDEX FOR ORDER BY(idx_actor_first_name) ON fa.actor_id = a.actor_id WHERE a.last_name LIKE 'B%' AND f.release_year = 2006 ORDER BY a.first_name\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: a type: index possible_keys: PRIMARY key: idx_actor_first_name key_len: 137 ref: NULL rows: 200 Extra: Using where *************************** 2. row *************************** id: 1 select_type: SIMPLE table: fa type: ref possible_keys: PRIMARY,idx_fk_film_id key: PRIMARY key_len: 2 ref: sakila.a.actor_id rows: 1 Extra: Using index *************************** 3. row *************************** id: 1 select_type: SIMPLE table: f type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 2 ref: sakila.fa.film_id rows: 1 Extra: Using where 3 rows in set (0.00 sec)

Please note the position of the hints - they are located after the relevant table and its alias:

JOIN actor AS a IGNORE INDEX FOR JOIN (idx_actor_last_name) FORCE INDEX FOR ORDER BY(idx_actor_first_name)

Thanks to this, the optimizer can link the hints to the correct table.

Aside of … FOR JOIN and … FOR ORDER BY there’s one more hint: … FOR GROUP BY which is applied to use an index for aggregating data.

When you use USE INDEX, FORCE INDEX or IGNORE INDEX, it is the equivalent of combining all of the previously mentioned hints. For example:

FORCE INDEX (idx_myindex):

FORCE INDEX FOR JOIN (idx_myindex)

FORCE INDEX FOR ORDER BY (idx_myindex)

FORCE INDEX FOR GROUP BY (idx_myindex)

JOIN order modification

When you are executing any query with JOINs, the MySQL optimizer has to decide the order in which those tables should be joined. You might not be happy with the order it comes up with.

Let’s look at this query.

mysql> EXPLAIN SELECT actor_id, title FROM film_actor AS fa JOIN film AS f ON fa.film_id = f.film_id ORDER BY fa.actor_id\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: f type: index possible_keys: PRIMARY key: idx_title key_len: 767 ref: NULL rows: 1000 Extra: Using index; Using temporary; Using filesort *************************** 2. row *************************** id: 1 select_type: SIMPLE table: fa type: ref possible_keys: idx_fk_film_id key: idx_fk_film_id key_len: 2 ref: sakila.f.film_id rows: 2 Extra: Using index 2 rows in set (0.00 sec)

Please notice that in the query, we are sorting data using ‘film_actor.actor_id’ column. In the current query execution plan, the optimizer starts with the ‘film’ table and then join ‘film_actor’ to it using idx_fk_film_id index. Sorting has to be performed using the filesort algorithm and a temporary table, because the current join order makes it impossible to use any index for sorting.

Let’s say that for some reason (maybe temporary table is too impacting), we’d prefer to avoid filesort and perform ORDER BY using index. It is possible as we have an index we could use:

PRIMARY KEY (`actor_id`,`film_id`),

There’s an easy way to make such change - we can use STRAIGHT_JOIN hint. It can be used in two ways. If you use it as:

SELECT STRAIGHT_JOIN ...

this will mean that the tables should be joined the exact order as they appear in the query. So, in case of the following type of query:

SELECT STRAIGHT_JOIN * FROM tab1 JOIN tab2 ON tab1.a = tab2.a JOIN tab3 ON tab2.b = tab3.b;

we can be sure that tables will be joined in following order:

tab1, tab2, tab3

STRAIGHT_JOIN can also be used within the query, instead of a JOIN, in the following way:

SELECT * FROM tab1 JOIN tab2 ON tab1.a = tab2.a STRAIGHT_JOIN tab3 ON tab2.b = tab3.b;

This forces tab3 to be joined to tab2 exactly in this order. Optimizer has the following combinations to choose from:

tab1, tab2, tab3

tab2, tab3, tab1

Please note that we are talking only about JOIN here. This is because using LEFT or RIGHT JOIN already determines how tables should be joined - STRAIGHT_JOIN won’t have any effect here.

Ok, let’s get back to our query. We are going to force JOIN in ‘film_actor’, ‘film’ order - exactly as tables appear in the query.

mysql> EXPLAIN SELECT STRAIGHT_JOIN actor_id, title FROM film_actor AS fa JOIN film AS f ON fa.film_id = f.film_id ORDER BY fa.actor_id\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: fa type: index possible_keys: idx_fk_film_id key: PRIMARY key_len: 4 ref: NULL rows: 5462 Extra: Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: f type: eq_ref possible_keys: PRIMARY key: PRIMARY key_len: 2 ref: sakila.fa.film_id rows: 1 Extra: NULL 2 rows in set (0.00 sec)

As expected, ‘film_actor’ was used as first table in the join - this allowed the optimizer to use primary key to sort the resultset. We managed to eliminate both filesort and temporary table. In fact, the new query is a little bit faster than the original one. Below is a sample of data from ‘sys’ schema created by Mark Leith.

mysql> select query, max_latency, avg_latency, rows_examined_avg from statement_analysis WHERE db='sakila' AND last_seen > (NOW() - INTERVAL 10 MINUTE) AND query like 'SELECT%'\G *************************** 1. row *************************** query: SELECT `actor_id` , `title` FR ... d` ORDER BY `fa` . `actor_id` max_latency: 22.10 ms avg_latency: 16.25 ms rows_examined_avg: 17386 *************************** 2. row *************************** query: SELECT STRAIGHT_JOIN `actor_id ... d` ORDER BY `fa` . `actor_id` max_latency: 16.75 ms avg_latency: 15.10 ms rows_examined_avg: 10924 2 rows in set (0.00 sec)

As you can see, we scan less rows and latency also seems better.

Switchable optimizations

With time, more and more optimizations are being added to the MySQL optimizer. Some of them can be controlled by the user - we can turn them off when we find they are not suitable for our query mix. We won’t be going into the details of every optimization, we’ll concentrate on what we can control and how.

You can list the whole set of switchable optimizations and their status for a current session by running:

mysql> SELECT @@optimizer_switch\G *************************** 1. row *************************** @@optimizer_switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,subquery_materialization_cost_based=on,use_index_extensions=on 1 row in set (0.00 sec)

The list depends on the MySQL version used, the above was taken on MySQL 5.6. MariaDB also differs from Oracle MySQL in this matter.

Each of the optimizations can be disabled on global level. For example, to disable index_merge optimization we can run:

mysql> SET GLOBAL optimizer_switch="index_merge=off";

To make a change on a session level we can run:

mysql> SET SESSION optimizer_switch="index_merge=off";

Let’s see how it works. We have the following query executed against the ‘sakila’ schema. It produces a query execution plan which involves index merge optimization, to be precise - union index merge.

mysql> EXPLAIN SELECT * FROM film_actor WHERE actor_id = 4 OR film_id = 7\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film_actor type: index_merge possible_keys: PRIMARY,idx_fk_film_id key: PRIMARY,idx_fk_film_id key_len: 2,2 ref: NULL rows: 27 Extra: Using union(PRIMARY,idx_fk_film_id); Using where 1 row in set (0.00 sec)

This means that MySQL will use two indexes (PRIMARY and idx_fk_film_id) to perform lookups. Let’s say we don’t want this particular optimization to be used here because we know there is a better execution plan.

We can either disable this index merge on global level but it may not be the best idea if there are other queries that can benefit from it. We can do it also on the session level, for this particular query only.

mysql> SET SESSION optimizer_switch="index_merge=off"; EXPLAIN SELECT * FROM film_actor WHERE actor_id = 4 OR film_id = 7\G SET SESSION optimizer_switch="index_merge=on"; Query OK, 0 rows affected (0.00 sec) *************************** 1. row *************************** id: 1 select_type: SIMPLE table: film_actor type: ALL possible_keys: PRIMARY,idx_fk_film_id key: NULL key_len: NULL ref: NULL rows: 5462 Extra: Using where 1 row in set (0.00 sec) Query OK, 0 rows affected (0.00 sec)

Changing optimizer flags is usually the last line of defence - most of the time forcing the optimizer to use a particular index (or prevent it from using it) will be more than enough to alter the query execution plan. It’s still good to know about this option, as it may be useful every now and then.

We are here closing the chapter on performance - we’ve covered the basics from database configuration to the query tuning process, workload analysis, indexing and EXPLAIN. We plan on covering more in the future but, starting with next blog post of the DBA series, we’ll move to troubleshooting - how do you find out what is wrong with your database? Why does it not start correctly? What to do if you suffer from transient performance problems? If you are running a cluster, why does a node not join the cluster