Answering questions asked on the site.

Princess asks:

Hello, I've got a problem with SQL. I have a table which contains the production details for the factories. They are not factories and items of course but I cannot disclose the project and need to obfuscate so let's pretend they are :) I need to select the items for which the first 5 factories have low production rate. I tried to do the query like this: SELECT ProductionItem FROM FactoryProductions WHERE 5 >= FactoryID AND 100 >= ProductionAmount which returns correct results but is slow. I have an index on (FactoryID, ProductionAmount) . There are 13 Factories and 2,300,000 Items This is in MySQL

This is a nice illustration of how index range evaluation works in MySQL and how to optimize it.

We will assume that the ProductionAmount is usually much higher than 100

Now, let's create a sample table:



Table creation details

CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_factory ( factory INT NOT NULL, item INT NOT NULL, production FLOAT NOT NULL, PRIMARY KEY (factory, item), KEY (factory, production) ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO t_factory (factory, item, production) SELECT (id - 1) div 100000 + 1, (id - 1) % 100000 + 1, RAND(20091007) * 10000000 FROM filler;

This table contains 10 factories, 100,000 items and random production values from 0 to 10,000,000. This is less than your table but enough to show the principle.

Let's issue your query:

SELECT * FROM t_factory WHERE factory <= 5 AND production <= 100

View query details

factory item production 4 24768 25.5276 5 20126 8.67993 5 47547 11.3249 3 rows fetched in 0.0002s (0.4110s) id select_type table type possible_keys key key_len ref rows filtered Extra 1 SIMPLE t_factory range PRIMARY,factory PRIMARY 4 500326 100.00 Using where select `20091007_in`.`t_factory`.`factory` AS `factory`,`20091007_in`.`t_factory`.`item` AS `item`,`20091007_in`.`t_factory`.`production` AS `production` from `20091007_in`.`t_factory` where ((`20091007_in`.`t_factory`.`factory` <= 5) and (`20091007_in`.`t_factory`.`production` <= 100))

We see that the query uses a range on the index over (factory, production) but it's still slow, 400 ms.

Why is it?

If we look into the execution plan, we will see that the key_len is 4 . However, we have two columns in the index, so the total key length should be 8 ( sizeof(int) + sizeof(int) ).

This means that only a part of a key is used. MySQL uses the range to get all records with factory <= 5 then browses them and filters out those with production > 100 .

This requires browsing lots of records.

MySQL does it because it cannot get a single, countiguous range from this condition. There may be potentially infinite number of records with factory <= 5 , each on them having its own range of production <= 100 , and these ranges are not consecutive.

Though MySQL cannot do a range access method on an infinite number of ranges, it can do it allright on a finite number of ranges.

That is, if we explicitly provide the id's of the factories, each id will make a single contiguous range: (id, -Inf) <= (factory, production) <= (id, 100)

And since we have a limited number of the id 's, MySQL can combine all these ranges and get them in a single trip over the index.

Here's the query to do that:

SELECT * FROM t_factory WHERE factory IN (1, 2, 3, 4, 5) AND production <= 100

View query details

factory item production 4 24768 25.5276 5 20126 8.67993 5 47547 11.3249 3 rows fetched in 0.0002s (0.0022s) id select_type table type possible_keys key key_len ref rows filtered Extra 1 SIMPLE t_factory range PRIMARY,factory factory 8 6 100.00 Using where; Using index select `20091007_in`.`t_factory`.`factory` AS `factory`,`20091007_in`.`t_factory`.`item` AS `item`,`20091007_in`.`t_factory`.`production` AS `production` from `20091007_in`.`t_factory` where ((`20091007_in`.`t_factory`.`factory` in (1,2,3,4,5)) and (`20091007_in`.`t_factory`.`production` <= 100))

This query is instant, only 2 ms.

But what if this list is too long to generate?

We can build in runtime with the same condition. The only thing we need is to do it over the DISTINCT values factory . Since we have an index on (factory, production) , DISTINCT is very fast because it will be optimized with an index for group by :

SELECT fi.* FROM ( SELECT DISTINCT factory FROM t_factory WHERE factory <= 5 ) fd JOIN t_factory fi ON fi.factory <= fd.factory AND fi.factory >= fd.factory AND fi.production <= 100

View query details

factory item production 4 24768 25.5276 5 20126 8.67993 5 47547 11.3249 3 rows fetched in 0.0002s (0.0026s) id select_type table type possible_keys key key_len ref rows filtered Extra 1 PRIMARY <derived2> ALL 5 100.00 1 PRIMARY fi ALL PRIMARY,factory 1000653 100.00 Range checked for each record (index map: 0x3) 2 DERIVED t_factory range PRIMARY,factory PRIMARY 4 9 100.00 Using where; Using index for group-by select `20091007_in`.`fi`.`factory` AS `factory`,`20091007_in`.`fi`.`item` AS `item`,`20091007_in`.`fi`.`production` AS `production` from (select distinct `20091007_in`.`t_factory`.`factory` AS `factory` from `20091007_in`.`t_factory` where (`20091007_in`.`t_factory`.`factory` <= 5)) `fd` join `20091007_in`.`t_factory` `fi` where ((`20091007_in`.`fi`.`factory` <= `fd`.`factory`) and (`20091007_in`.`fi`.`factory` >= `fd`.`factory`) and (`20091007_in`.`fi`.`production` <= 100))

The query is split into three parts:

The range is applied to the table: only that part of the index is used which has factory <= 5 . DISTINCT clause is applied. Using index for group-by in the query plan means that the engine skips the duplicate values using the index: whenever it finds a key, it returns it and then searches for the next key. This is very fast. Finally, t_factory is joined back. Note usage of the following condition in the WHERE clause: fi.factory <= fd.factory AND fi.factory >= fd.factory . Those who read MySQL articles in my blog are familiar with it: it forces the Range checked for each record . For each distinct factory returned from the query above, the engine evaluates the condition, finds it to be the equivalent to (fd.factory, -Inf) <= (fi.factory, fi.production) <= (fi.factory, +Inf) and uses the range access to find the values for this condition.

This query is only 2 ms, also instant.

Hope that helps.

I'm always glad to answer the questions regarding database queries.

Ask me a question