I thought I would take the new subquery optimizations in MySQL 5.6 for a spin today, using the world sample database provided by MySQL for certification and training.

Typical IN subquery

This is a very typical query developers run, which historically has performed very poorly on MySQL:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN (SELECT code FROM Country WHERE name = 'United States'); *************************** 1. row *************************** id: 1 select_type: PRIMARY table: City type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 3984 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: Country type: unique_subquery possible_keys: PRIMARY key: PRIMARY key_len: 3 ref: func rows: 1 Extra: Using where 2 rows in set (0.00 sec) mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN (SELECT code FROM Country WHERE name = 'United States'); *************************** 1. row *************************** id: 1 select_type: SIMPLE table: Country type: ALL possible_keys: PRIMARY key: NULL key_len: NULL ref: NULL rows: 239 Extra: Using where *************************** 2. row *************************** id: 1 select_type: SIMPLE table: City type: ref possible_keys: CountryCode key: CountryCode key_len: 3 ref: world.Country.Code rows: 1 Extra: NULL 2 rows in set (0.00 sec)

Notice that in MySQL 5.6 – the very first table accessed is Country instead of City. MySQL 5.5 was not able to recognize this as a constant, and instead executed this as a DEPENDENT SUBQUERY (aka Correlated subquery) for each row it found in the city table (an estimated 3984 rows)!

MySQL 5.6 still has a table scan on Country, but I can address that with an index on Country.name:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (name); Query OK, 0 rows affected (0.04 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN (SELECT code FROM Country WHERE name = 'United States')\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: City type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 3984 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: Country type: unique_subquery possible_keys: PRIMARY,Name key: PRIMARY key_len: 3 ref: func rows: 1 Extra: Using where 2 rows in set (0.00 sec) mysql5.6.11 > ALTER TABLE Country ADD INDEX (name); Query OK, 0 rows affected (0.04 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN (SELECT code FROM Country WHERE name = 'United States')\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: Country type: ref possible_keys: PRIMARY,Name key: Name key_len: 52 ref: const rows: 1 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: City type: ref possible_keys: CountryCode key: CountryCode key_len: 3 ref: world.Country.Code rows: 1 Extra: NULL 2 rows in set (0.00 sec)

The index doesn’t affect MySQL 5.5 – which still executes as a DEPENDENT SUBQUERY, but take a look at MySQL 5.6 – 1 row from the Country table (from an index!) and then 1 row from the City table. This optimizes great!

More complex IN example

In this example I thought I would try to find all cities in the country with the largest population. My first attempt was to see if I could now use a LIMIT in a subquery. It looks like I’ll have to wait a bit longer:

mysql5.5.31 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1); ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery' mysql5.6.11 > select * from City WHERE CountryCode IN (SELECT Code FROM country order by population desc limit 1); ERROR 1235 (42000): This version of MySQL doesn't yet support 'LIMIT & IN/ALL/ANY/SOME subquery'

So here is my second attempt:

mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN (SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: City type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 3984 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: country type: unique_subquery possible_keys: PRIMARY key: PRIMARY key_len: 3 ref: func rows: 1 Extra: Using where *************************** 3. row *************************** id: 3 select_type: SUBQUERY table: Country type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 219 Extra: 3 rows in set (0.00 sec) mysql5.6.11 > EXPLAIN SELECT * FROM City WHERE CountryCode IN (SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: country type: ALL possible_keys: PRIMARY key: NULL key_len: NULL ref: NULL rows: 239 Extra: Using where *************************** 2. row *************************** id: 1 select_type: PRIMARY table: City type: ref possible_keys: CountryCode key: CountryCode key_len: 3 ref: world.country.Code rows: 1 Extra: NULL *************************** 3. row *************************** id: 3 select_type: SUBQUERY table: Country type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 239 Extra: NULL 3 rows in set (0.00 sec)

MySQL 5.5 could always optimize the population = scalar subquery, but not the IN subquery. Similar to the above example, I would expect the subqueries here should be unraveled as constants as well. If I add an index on population you can really see this happen:

mysql5.5.31 > ALTER TABLE Country ADD INDEX (population); Query OK, 0 rows affected (0.05 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql5.5.31 > EXPLAIN SELECT * FROM City WHERE CountryCode IN (SELECT Code FROM country WHERE population = (SELECT max(population) FROM Country))\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: City type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 3984 Extra: Using where *************************** 2. row *************************** id: 2 select_type: DEPENDENT SUBQUERY table: country type: unique_subquery possible_keys: PRIMARY,Population key: PRIMARY key_len: 3 ref: func rows: 1 Extra: Using where *************************** 3. row *************************** id: 3 select_type: SUBQUERY table: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: Select tables optimized away 3 rows in set (0.00 sec) mysql5.6.11 > ALTER TABLE country add index (population); Query OK, 0 rows affected (0.05 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql5.6.11 > EXPLAIN select * from City WHERE CountryCode IN (SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table: country type: ref possible_keys: PRIMARY,Population key: Population key_len: 4 ref: const rows: 1 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: PRIMARY table: City type: ref possible_keys: CountryCode key: CountryCode key_len: 3 ref: world.country.Code rows: 1 Extra: NULL *************************** 3. row *************************** id: 3 select_type: SUBQUERY table: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: Select tables optimized away 3 rows in set (0.00 sec)

This is looking really good in MySQL 5.6. I had a bit of a huh? moment when trying to read what the Select tables optimized away step #3 meant. This led me to try using EXPLAIN EXTENDED where I discovered a little gem:

mysql5.5.31 > SHOW WARNINGS\G *************************** 1. row *************************** Level: Note Code: 1003 Message: select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`, `world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`, `world`.`city`.`Population` AS `Population` from `world`.`city` where <in_optimizer>(`world`.`city`.`CountryCode`, <exists>(<primary_index_lookup>(<cache>(`world`.`city`.`CountryCode`) in country on PRIMARY where ((`world`.`country`.`Population` = (select max(`world`.`country`.`Population`) from `world`.`country`)) and (<cache>(`world`.`city`.`CountryCode`) = `world`.`country`.`Code`))))) 1 row in set (0.00 sec) mysql5.6.11 > show warnings\G *************************** 1. row *************************** Level: Note Code: 1003 Message: /* select#1 */ select `world`.`city`.`ID` AS `ID`,`world`.`city`.`Name` AS `Name`, `world`.`city`.`CountryCode` AS `CountryCode`,`world`.`city`.`District` AS `District`, `world`.`city`.`Population` AS `Population` from `world`.`country` join `world`.`city` where ((`world`.`city`.`CountryCode` = `world`.`country`.`Code`) and (`world`.`country`.`Population` = (/* select#3 */ select max(`world`.`country`.`Population`) from `world`.`country`))) 1 row in set (0.00 sec)

EXPLAIN EXTENDED writes the approximate query that MySQL is going to execute after the optimizer has applied any optimizations or transformations. This has been enhanced in MySQL 5.6 to add a comment for each step in the query execution (IDs match up to those in EXPLAIN). So if it was ever unclear, it is clearly this portion that is being optimized away:

mysql5.5.31 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: Select tables optimized away 1 row in set (0.00 sec) mysql5.6.11 > EXPLAIN select max(`world`.`country`.`Population`) from `world`.`country`\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: NULL type: NULL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: NULL Extra: Select tables optimized away 1 row in set (0.00 sec)

I believe what’s happening here, is during optimization MySQL opens the index population and looks at the last value (very cheap on a B-tree). So it kind of cheats and does some of the work before it has to. I’ve seen it do this before, here is a more common example of this cheating happening:

mysql5.5.31 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: City type: const possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: const rows: 1 filtered: 100.00 Extra: 1 row in set, 1 warning (0.00 sec) mysql5.5.31 > show warnings\G *************************** 1. row *************************** Level: Note Code: 1003 Message: select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS `CountryCode`, 'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city` where 1 1 row in set (0.00 sec) mysql5.6.11 > EXPLAIN EXTENDED SELECT * FROM City WHERE id = 1890\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: City type: const possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: const rows: 1 filtered: 100.00 Extra: NULL 1 row in set, 1 warning (0.00 sec) mysql5.6.11 > show warnings\G *************************** 1. row *************************** Level: Note Code: 1003 Message: /* select#1 */ select '1890' AS `ID`,'Shanghai' AS `Name`,'CHN' AS `CountryCode`,'Shanghai' AS `District`,'9696300' AS `Population` from `world`.`city` WHERE 1 1 row in set (0.00 sec)

Anyway, back to my original query. With the nesting of my IN queries I sometimes find it difficult to read the output of EXPLAIN and understand the order of execution. MySQL 5.6 also has FORMAT=JSON , which looks much nicer to me and it includes more information:

mysql5.6.11 > EXPLAIN format=json select * from City WHERE CountryCode IN (SELECT Code FROM country where population = (SELECT max(population) FROM Country))\G *************************** 1. row *************************** EXPLAIN: { "query_block": { "select_id": 1, "nested_loop": [ { "table": { "table_name": "country", "access_type": "ref", "possible_keys": [ "PRIMARY", "Population" ], "key": "Population", "used_key_parts": [ "Population" ], "key_length": "4", "ref": [ "const" ], "rows": 1, "filtered": 100, "using_index": true, "attached_condition": "(`world`.`country`.`Population` = (/* select#3 */ select max(`world`.`country`.`Population`) from `world`.`country`))", "attached_subqueries": [ { "dependent": false, "cacheable": true, "query_block": { "select_id": 3, "table": { "message": "Select tables optimized away" } } } ] } }, { "table": { "table_name": "City", "access_type": "ref", "possible_keys": [ "CountryCode" ], "key": "CountryCode", "used_key_parts": [ "CountryCode" ], "key_length": "3", "ref": [ "world.country.Code" ], "rows": 1, "filtered": 100 } } ] } } 1 row in set, 1 warning (0.00 sec)

While it’s possible that these queries could have been rewritten to be efficient joins, I really like seeing query optimizations being introduced to eliminate common paper cuts. Improving diagnostic features doesn’t hurt either 😉 I’m really looking forward to what tools can be built to take advantage of the JSON explain output.