JOINs are expensive and it most typical the fewer tables (for the same database) you join the better performance you will get. As for any rules there are however exceptions 🙂

The one I’m speaking about comes from the issue with MySQL optimizer stopping using further index key parts as soon as there is a range clause on the previous key part. So if you have INDEX(A,B) and have a where clause A BETWEEN 5 and 10 AND B=6 only the first part (A) of the index will be used which can be seriously affect performance. Of course in this example you can use index (B,A) but there are many similar cases when it is not possible.



I have described couple of solutions to this problem – using IN list instead of range or UNION which however require rather serious application changes and also can result in huge IN lists and suboptimal execution for large ranges.

Lets take a look at very typical reporting query which queries data for date range for multiple of groups (these can be devices, pages, users …. etc)

CREATE TABLE `info` ( `id` int(10) unsigned NOT NULL auto_increment, `d` date NOT NULL, `group_id` int(10) unsigned NOT NULL, `events` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), KEY `d` (`d`,`group_id`) ) ENGINE=MyISAM AUTO_INCREMENT=18007591 DEFAULT CHARSET=latin1 mysql> select sum(events) from info where d between '2007-01-01' and '2007-01-31' and group_id in (10,20,30,40,50,60,70,80,90,100); +-------------+ | sum(events) | +-------------+ | 3289092 | +-------------+ 1 row in set (1.04 sec) mysql> explain select sum(events) from info where d between '2007-01-01' and '2007-01-31' and group_id in (10,20,30,40,50,60,70,80,90,100) \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: info type: range possible_keys: d key: d key_len: 7 ref: NULL rows: 355213 Extra: Using where 1 row in set (0.00 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 CREATE TABLE ` info ` ( ` id ` int ( 10 ) unsigned NOT NULL auto_increment , ` d ` date NOT NULL , ` group_id ` int ( 10 ) unsigned NOT NULL , ` events ` int ( 10 ) unsigned NOT NULL , PRIMARY KEY ( ` id ` ) , KEY ` d ` ( ` d ` , ` group_id ` ) ) ENGINE = MyISAM AUTO_INCREMENT = 18007591 DEFAULT CHARSET = latin1 mysql > select sum ( events ) from info where d between '2007-01-01' and '2007-01-31' and group_id in ( 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 , 100 ) ; + -- -- -- -- -- -- - + | sum ( events ) | + -- -- -- -- -- -- - + | 3289092 | + -- -- -- -- -- -- - + 1 row in set ( 1.04 sec ) mysql > explain select sum ( events ) from info where d between '2007-01-01' and '2007-01-31' and group_id in ( 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 , 100 ) \ G * * * * * * * * * * * * * * * * * * * * * * * * * * * 1. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : info type : range possible_keys : d key : d key_len : 7 ref : NULL rows : 355213 Extra : Using where 1 row in set ( 0.00 sec )

As you can see from the EXPLAIN this query is expected to analyze over 300.000 of rows which is relatively fast for this (in memory) table but will become unacceptable as soon as you get to do random disk IO.

Note this is also interesting case of EXPLAIN being wrong – it shows key_len=7 which corresponds to the full key while only first key part is used.

Let us now replace the range with IN list in this query:

mysql> explain select sum(events) from info where d IN('2007-01-01','2007-01-02','2007-01-03','2007-01-04','2007-01-05','2007-01-06','2007-01-07','2007-01-08','2007-01-09','2007-01-10','2007-01-11','2007-01-12','2007-01-13','2007-01-14','2007-01-15','2007-01-16','2007-01-17','2007-01-18','2007-01-19','2007-01-20','2007-01-21','2007-01-22','2007-01-23','2007-01-24','2007-01-25','2007-01-26','2007-01-27','2007-01-28','2007-01-29','2007-01-30','2007-01-31') and group_id in (10,20,30,40,50,60,70,80,90,100) \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: info type: range possible_keys: d key: d key_len: 7 ref: NULL rows: 3681 Extra: Using where 1 row in set (0.01 sec) mysql> select sum(events) from info where d IN('2007-01-01','2007-01-02','2007-01-03','2007-01-04','2007-01-05','2007-01-06','2007-01-07','2007-01-08','2007-01-09','2007-01-10','2007-01-11','2007-01-12','2007-01-13','2007-01-14','2007-01-15','2007-01-16','2007-01-17','2007-01-18','2007-01-19','2007-01-20','2007-01-21','2007-01-22','2007-01-23','2007-01-24','2007-01-25','2007-01-26','2007-01-27','2007-01-28','2007-01-29','2007-01-30','2007-01-31') and group_id in (10,20,30,40,50,60,70,80,90,100); +-------------+ | sum(events) | +-------------+ | 3289092 | +-------------+ 1 row in set (0.02 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 mysql > explain select sum ( events ) from info where d IN ( '2007-01-01' , '2007-01-02' , '2007-01-03' , '2007-01-04' , '2007-01-05' , '2007-01-06' , '2007-01-07' , '2007-01-08' , '2007-01-09' , '2007-01-10' , '2007-01-11' , '2007-01-12' , '2007-01-13' , '2007-01-14' , '2007-01-15' , '2007-01-16' , '2007-01-17' , '2007-01-18' , '2007-01-19' , '2007-01-20' , '2007-01-21' , '2007-01-22' , '2007-01-23' , '2007-01-24' , '2007-01-25' , '2007-01-26' , '2007-01-27' , '2007-01-28' , '2007-01-29' , '2007-01-30' , '2007-01-31' ) and group_id in ( 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 , 100 ) \ G * * * * * * * * * * * * * * * * * * * * * * * * * * * 1. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : info type : range possible_keys : d key : d key_len : 7 ref : NULL rows : 3681 Extra : Using where 1 row in set ( 0.01 sec ) mysql > select sum ( events ) from info where d IN ( '2007-01-01' , '2007-01-02' , '2007-01-03' , '2007-01-04' , '2007-01-05' , '2007-01-06' , '2007-01-07' , '2007-01-08' , '2007-01-09' , '2007-01-10' , '2007-01-11' , '2007-01-12' , '2007-01-13' , '2007-01-14' , '2007-01-15' , '2007-01-16' , '2007-01-17' , '2007-01-18' , '2007-01-19' , '2007-01-20' , '2007-01-21' , '2007-01-22' , '2007-01-23' , '2007-01-24' , '2007-01-25' , '2007-01-26' , '2007-01-27' , '2007-01-28' , '2007-01-29' , '2007-01-30' , '2007-01-31' ) and group_id in ( 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 , 100 ) ; + -- -- -- -- -- -- - + | sum ( events ) | + -- -- -- -- -- -- - + | 3289092 | + -- -- -- -- -- -- - + 1 row in set ( 0.02 sec )

So we get same result but approximately 50 times faster. In this report we had just one month worth of data – what if you would have a year ? 5 years ? What if you get say thousands of groups at the same time ? Performing such query MySQL has to build (and do lookups) for all combinations which is 31*10=310 in this case. But if it gets to hundreds of thousands this method starts to break (and newer MySQL versions will stop using this optimization method if there are too many combinations to check).

Instead you could use JOIN to get list of days matching range from some pre-generated table and use the join to retrieve the rows from original table:

mysql> show create table dl \G *************************** 1. row *************************** Table: dl Create Table: CREATE TABLE `dl` ( `myday` date NOT NULL, PRIMARY KEY (`myday`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 1 row in set (0.00 sec) mysql> select * from dl limit 5; +------------+ | myday | +------------+ | 2001-01-01 | | 2001-01-02 | | 2001-01-03 | | 2001-01-04 | | 2001-01-05 | +------------+ 5 rows in set (0.00 sec) mysql> explain select sum(events) from info,dl where myday between '2007-01-01' and '2007-01-31' and myday=d and group_id in (10,20,30,40,50,60,70,80,90,100) \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: dl type: range possible_keys: PRIMARY key: PRIMARY key_len: 3 ref: NULL rows: 30 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: info type: range possible_keys: d key: d key_len: 7 ref: NULL rows: 355213 Extra: Using where 2 rows in set (0.00 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 mysql > show create table dl \ G * * * * * * * * * * * * * * * * * * * * * * * * * * * 1. row * * * * * * * * * * * * * * * * * * * * * * * * * * * Table : dl Create Table : CREATE TABLE ` dl ` ( ` myday ` date NOT NULL , PRIMARY KEY ( ` myday ` ) ) ENGINE = MyISAM DEFAULT CHARSET = latin1 1 row in set ( 0.00 sec ) mysql > select * from dl limit 5 ; + -- -- -- -- -- -- + | myday | + -- -- -- -- -- -- + | 2001 - 01 - 01 | | 2001 - 01 - 02 | | 2001 - 01 - 03 | | 2001 - 01 - 04 | | 2001 - 01 - 05 | + -- -- -- -- -- -- + 5 rows in set ( 0.00 sec ) mysql > explain select sum ( events ) from info , dl where myday between '2007-01-01' and '2007-01-31' and myday = d and group_id in ( 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 , 100 ) \ G * * * * * * * * * * * * * * * * * * * * * * * * * * * 1. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : dl type : range possible_keys : PRIMARY key : PRIMARY key_len : 3 ref : NULL rows : 30 Extra : Using where ; Using index * * * * * * * * * * * * * * * * * * * * * * * * * * * 2. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : info type : range possible_keys : d key : d key_len : 7 ref : NULL rows : 355213 Extra : Using where 2 rows in set ( 0.00 sec )

As you can see it does not work while I know I used exactly this trick to optimize some nasty queries.

It looks like equality propagation is working here (note the number of rows for second table in join is estimated same in original query) and we get the range clause on “info” table instead nested loops join – exactly what we tried to avoid.

It is easy to block equality propagation by using some trivial function:

mysql> explain select sum(events) from info,dl where myday between '2007-01-01' and '2007-01-31' and d=date(myday) and group_id in (10,20,30,40,50,60,70,80,90,100) \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: dl type: range possible_keys: PRIMARY key: PRIMARY key_len: 3 ref: NULL rows: 30 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: info type: ref possible_keys: d key: d key_len: 3 ref: func rows: 17990 Extra: Using where 2 rows in set (0.00 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 mysql > explain select sum ( events ) from info , dl where myday between '2007-01-01' and '2007-01-31' and d = date ( myday ) and group_id in ( 10 , 20 , 30 , 40 , 50 , 60 , 70 , 80 , 90 , 100 ) \ G * * * * * * * * * * * * * * * * * * * * * * * * * * * 1. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : dl type : range possible_keys : PRIMARY key : PRIMARY key_len : 3 ref : NULL rows : 30 Extra : Using where ; Using index * * * * * * * * * * * * * * * * * * * * * * * * * * * 2. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : info type : ref possible_keys : d key : d key_len : 3 ref : func rows : 17990 Extra : Using where 2 rows in set ( 0.00 sec )

So we stopped equality propagation but now have another problem – for some reason MySQL decides to only do “ref” on the date only instead of using range on day and list of groups for each join iteration.

This does not make sense but this is how it is.

I also tried to increase cardinality by having all rows to have different group_id and it still does not work.

The trick however does work if you have just one group_id (and in this case you do not even need to trick around equity propagation to make it work)

mysql> explain select sum(events) from info,dl where myday between '2007-01-01' and '2007-01-31' and d=myday and group_id in (10) \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: dl type: range possible_keys: PRIMARY key: PRIMARY key_len: 3 ref: NULL rows: 30 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: info type: ref possible_keys: d key: d key_len: 7 ref: test.dl.myday,const rows: 18 Extra: 2 rows in set (0.00 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 mysql > explain select sum ( events ) from info , dl where myday between '2007-01-01' and '2007-01-31' and d = myday and group_id in ( 10 ) \ G * * * * * * * * * * * * * * * * * * * * * * * * * * * 1. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : dl type : range possible_keys : PRIMARY key : PRIMARY key_len : 3 ref : NULL rows : 30 Extra : Using where ; Using index * * * * * * * * * * * * * * * * * * * * * * * * * * * 2. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : info type : ref possible_keys : d key : d key_len : 7 ref : test .dl .myday , const rows : 18 Extra : 2 rows in set ( 0.00 sec )

For original query form with single group_id query was taking 0.95 sec. The query with BETWEEN range replaced with IN list was instant 0.00 sec same as the query using join with day list table.

So we finally managed to get better performance by joining data to yet another table though why it does not work for multiple group remains question to check with MySQL Optimizer team 🙂

UPDATE: I just heard back from Igor Babaev saying it was designed this way (because the first component can run through very many values). The second component is simply not considered for range unless it is equality. You always have something to learn about MySQL Optimizer gotchas 🙂

At the same time I figured out how to make MySQL Optimizer to do what we want to do – Just add yet another table to the join so the info table just has bunch of ref lookups:

mysql> select * from g; +-----+ | gr | +-----+ | 10 | | 20 | | 30 | | 40 | | 50 | | 60 | | 70 | | 80 | | 90 | | 100 | +-----+ 10 rows in set (0.00 sec) mysql> EXPLAIN SELECT sum(events) FROM g,info,dl WHERE myday BETWEEN '2007-01-01' AND '2007-01-31' AND myday=d AND group_id=g.gr \G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: dl type: range possible_keys: PRIMARY key: PRIMARY key_len: 3 ref: NULL rows: 30 Extra: Using where; Using index *************************** 2. row *************************** id: 1 select_type: SIMPLE table: g type: index possible_keys: PRIMARY key: PRIMARY key_len: 4 ref: NULL rows: 10 Extra: Using index *************************** 3. row *************************** id: 1 select_type: SIMPLE table: info type: ref possible_keys: d key: d key_len: 7 ref: test.dl.myday,test.g.gr rows: 18 Extra: 3 rows in set (0.00 sec) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 mysql > select * from g ; + -- -- - + | gr | + -- -- - + | 10 | | 20 | | 30 | | 40 | | 50 | | 60 | | 70 | | 80 | | 90 | | 100 | + -- -- - + 10 rows in set ( 0.00 sec ) mysql > EXPLAIN SELECT sum ( events ) FROM g , info , dl WHERE myday BETWEEN '2007-01-01' AND '2007-01-31' AND myday = d AND group_id = g .gr \ G * * * * * * * * * * * * * * * * * * * * * * * * * * * 1. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : dl type : range possible_keys : PRIMARY key : PRIMARY key_len : 3 ref : NULL rows : 30 Extra : Using where ; Using index * * * * * * * * * * * * * * * * * * * * * * * * * * * 2. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : g type : index possible_keys : PRIMARY key : PRIMARY key_len : 4 ref : NULL rows : 10 Extra : Using index * * * * * * * * * * * * * * * * * * * * * * * * * * * 3. row * * * * * * * * * * * * * * * * * * * * * * * * * * * id : 1 select_type : SIMPLE table : info type : ref possible_keys : d key : d key_len : 7 ref : test .dl .myday , test .g .gr rows : 18 Extra : 3 rows in set ( 0.00 sec )

This query looks very scary but in fact perform much better than original one. In the real queries you can use table with ids just as we had table of days with a where clause instead of precreated table.