Profiling your slow queries using pt-query-digest and some love from Percona Server

Overview

Profiling, analyzing and then fixing slow queries is likely the most oft-repeated part of a job of a DBA. And there are not too many tools out there that can make your life easy by providing analysis of queries with such data points that allow you to attack the right queries in the right way. One such tool that I have always found myself using is pt-query-digest(formerly known as mk-query-digest).

Now let us go through using this very nice tool.

Before We Start!

But before we start, make sure you have enabled slow query logging and set a low enough long_query_time. The correct value of long_query_time depends on your application requirements, a long_query_time of 1 second or 2 seconds might be sufficient for most of the users. Its also typical to see in many cases that you set the long_query_time to 0 for a small period of time to log all the queries.

Note that logging all queries in this fashion as opposed to the general query log, enables us to have the statistics available after the query is actually executed, while no such statistics are available for queries that are logged using the general query log.

And there might be other cases when you want to log queries taking less than 1 second of time(micro-seconds), for that you can specify a value in fractions, for example, you can specify long_query_time=0.5 to log queries taking greater than half-a-second.

Note that logging queries taking fraction of a second is not possible for versions of MySQL < 5.1, unless you use the microslow patch developed by Percona. You can follow the guide here if you are still running MySQL < 5.1 and would like to install this patch. You should also note that for versions of MySQL < 5.1 setting long_query_time=0 would actually disable the slow query logging.

Installing pt-query-digest tool (as well as other tools from Percona Toolkit) is very easy, and is explained here at this link.

Using pt-query-digest

Using pt-query-digest is pretty straight forward:

pt-query-digest /path/to/slow-query.log

Note that executing pt-query-digest can be pretty CPU and memory consuming, so ideally you should download the “slow query log” to another machine and run it there.

Analyzing pt-query-digest Output

Now let’s see what output it returns. The first part of the output is an overall summary:

# 250ms user time, 20ms system time, 17.38M rss, 53.62M vsz # Current date: Wed Dec 28 08:16:13 2011 # Hostname: somehost.net # Files: ./slow-query.log # Overall: 296 total, 12 unique, 0.00 QPS, 0.00x concurrency _____________ # Time range: 2011-11-26 17:44:58 to 2011-12-27 13:01:44 # Attribute total min max avg 95% stddev median # ============ ======= ======= ======= ======= ======= ======= ======= # Exec time 736s 1s 23s 2s 6s 2s 1s # Lock time 5ms 0 290us 17us 103us 43us 0 # Rows sent 8.31M 0 8.31M 28.75k 202.40 477.30k 0 # Rows examine 53.69k 0 3.51k 185.75 964.41 497.06 0 # Rows affecte 38 0 1 0.13 0.99 0.33 0 # Rows read 8.31M 0 8.31M 28.75k 202.40 477.30k 0 # Bytes sent 294.94M 0 275.07M 1020.33k 79.83k 15.63M 79.83k # Tmp tables 16 0 4 0.05 0 0.45 0 # Tmp disk tbl 0 0 0 0 0 0 0 # Tmp tbl size 1.94M 0 496.08k 6.70k 0 56.06k 0 # Query size 36.89k 44 886 127.63 420.77 144.69 69.19

It tells you that there are a total of 296 slow queries which are actually invocations of 12 different queries. Following that there are summaries of various other data points such as the total query execution time and the average query execution time. One thing I suggest here is that, you should probably give more importance to the times/values reported in the 95% (95th percentile) column as that gives us more accurate understanding. Now, for example, from this summary you can easily see whether you need to bump up the tmp_table_size variable if you see a big “Tmp disk tbl” number, you can easily adjust the value of tmp_table_size by taking a look at the 95% column for the row dealing with “Tmp tbl size”. Pretty nifty!

Let’s analyze next part of the output produced by pt-query-digest.

# Profile # Rank Query ID Response time Calls R/Call Apdx V/M Item # ==== ================== ============== ===== ======= ==== ===== ======== # 1 0x92F3B1B361FB0E5B 644.9895 87.6% 244 2.6434 0.44 1.26 SELECT wp_options # 2 0x555191621979A464 33.6349 4.6% 30 1.1212 0.65 0.03 REPLACE SELECT test.checksum test.sbtest_myisam # 3 0x8354260420CBD34B 22.6124 3.1% 1 22.6124 0.00 0.00 SELECT customer address category

The above part of the output ranks the queries and shows the top slowest queries. As we can see here the slowest one is the SELECT wp_options , this is basically a unique way of identifying the query and simply implies that this is a SELECT query executed against the wp_options table.

Now let’s take a look at the most important part of the output:

# Query 1: 0.00 QPS, 0.00x concurrency, ID 0x92F3B1B361FB0E5B at byte 119442 # This item is included in the report because it matches --limit. # Scores: Apdex = 0.44 [1.0], V/M = 1.26 # Query_time sparkline: | ^ | # Time range: 2011-12-08 17:48:20 to 2011-12-27 13:01:44 # Attribute pct total min max avg 95% stddev median # ============ === ======= ======= ======= ======= ======= ======= ======= # Count 82 244 # Exec time 87 645s 1s 10s 3s 6s 2s 1s # Lock time 37 2ms 0 201us 8us 60us 30us 0 # Rows sent 0 3.40k 0 211 14.28 202.40 51.53 0 # Rows examine 7 4.01k 0 252 16.84 234.30 60.55 0 # Rows affecte 0 0 0 0 0 0 0 0 # Rows read 0 3.40k 0 211 14.28 202.40 51.53 0 # Bytes sent 6 19.03M 41.35k 83.95k 79.88k 79.83k 6.92k 79.83k # Tmp tables 0 0 0 0 0 0 0 0 # Tmp disk tbl 0 0 0 0 0 0 0 0 # Tmp tbl size 0 0 0 0 0 0 0 0 # Query size 45 16.92k 71 71 71 71 0 71 # String: # Databases wp_blog_one (154/63%), wp_blog_tw... (81/33%)... 1 more # Hosts # InnoDB trxID 7F1910 (1/0%), 7F3860 (1/0%), 7F9F74 (1/0%)... 14 more # Last errno 0 # Users wp_blog_one (154/63%), wp_blog_two (81/33%)... 1 more # Query_time distribution # 1us # 10us # 100us # 1ms # 10ms # 100ms # 1s ################################################################ # 10s+ # Tables # SHOW TABLE STATUS FROM `wp_blog_two` LIKE 'wp_options'\G # SHOW CREATE TABLE `wp_blog_two`.`wp_options`\G # EXPLAIN /*!50100 PARTITIONS*/ SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'\G

This is the actual part of the output dealing with analysis of the most slowest query ranked #1. The first row in the table above shows the Count of number of times this query was executed. Now let’s take a look at the values in the 95% column, we can clearly see that this query is taking up a lot of execution time (6s) and is sending a lot of rows (202) and a lot of data (79.83k). The “Databases” section of the output also shows the name of the databases where this query was executed. Next the “Query_time distribution” section shows how this query times, which you can see lies in the range 1s to 10s all the time. The “Tables” section lists the queries that you can use to gather more data about the underlying tables involved and the query execution plan used by MySQL.

The end result might be that you end up limiting the number of results returned by the query, by using a LIMIT clause or by filtering based on the option_name column.

Let’s analyze another slow query, this time query ranked #3 by pt-query-digest.

# Query 3: 0 QPS, 0x concurrency, ID 0x8354260420CBD34B at byte 132619 ___ # This item is included in the report because it matches --limit. # Scores: Apdex = 0.00 [1.0]*, V/M = 0.00 # Query_time sparkline: | ^| # Time range: all events occurred at 2011-12-23 17:07:16 # Attribute pct total min max avg 95% stddev median # ============ === ======= ======= ======= ======= ======= ======= ======= # Count 0 1 # Exec time 3 23s 23s 23s 23s 23s 0 23s # Lock time 1 101us 101us 101us 101us 101us 0 101us # Rows sent 99 8.31M 8.31M 8.31M 8.31M 8.31M 0 8.31M # Rows examine 0 24 24 24 24 24 0 24 # Rows affecte 0 0 0 0 0 0 0 0 # Rows read 99 8.31M 8.31M 8.31M 8.31M 8.31M 0 8.31M # Bytes sent 93 275.07M 275.07M 275.07M 275.07M 275.07M 0 275.07M # Tmp tables 0 0 0 0 0 0 0 0 # Tmp disk tbl 0 0 0 0 0 0 0 0 # Tmp tbl size 0 0 0 0 0 0 0 0 # Query size 0 92 92 92 92 92 0 92 # String: # Databases another_db # Hosts # InnoDB trxID 83E808 # Last errno 0 # Users another_db # Query_time distribution # 1us # 10us # 100us # 1ms # 10ms # 100ms # 1s # 10s+ ################################################################ # Tables # SHOW TABLE STATUS FROM `another_db` LIKE 'customer'\G # SHOW CREATE TABLE `another_db`.`customer`\G # SHOW TABLE STATUS FROM `another_db` LIKE 'address'\G # SHOW CREATE TABLE `another_db`.`address`\G # SHOW TABLE STATUS FROM `another_db` LIKE 'category'\G # SHOW CREATE TABLE `another_db`.`category`\G # EXPLAIN /*!50100 PARTITIONS*/ select customer.store_id, address.address, category.name from customer, address, category\G

Let’s again take a look at the 95% column in the above output. And you can clearly see that the problem with this query is that its reading and sending far too many rows (8.31M) and sending far too much data (275M). Going down quickly to the “Tables” section, you can see why that is happening: “Three tables joined without a join condition meaning a cartesian product of all the rows in all the three tables”. See how easily you could pin-point the cause of slowness.

This query was actually executed on a demo of a MySQL Web Client running on one of the MySQL servers managed by me, providing me with good reason to turn on safe-updates option.

Gather Even More Stats with some extra love from Percona Server!

My colleagues at Percona have built in much more verbosity to what is output in the slow query log, which would give you even more insight into what is actually going on behind the scene during the execution of a particular slow query. This is through the edition of a new variable available in Percona Server 5.1 and above called log_slow_verbosity. You can read more about this variable and some other diagnostics added to the Percona server from this link. The variable log_slow_verbosity can have the values ‘microtime’, ‘query_plan’, ‘innodb’ and ‘full’. Let’s turn on log_slow_verbosity as follows:

set session log_slow_verbosity='microtime,query_plan,innodb';

And see how verbose the new entry in slow log is:

# Time: 111228 11:52:30 # User@Host: root[root] @ localhost [] # Thread_id: 57 Schema: mywebsql_demo Last_errno: 0 Killed: 0 # Query_time: 204.981516 Lock_time: 0.000133 Rows_sent: 10000 Rows_examined: 8721904 Rows_affected: 0 Rows_read: 10000 # Bytes_sent: 164952 Tmp_tables: 2 Tmp_disk_tables: 1 Tmp_table_sizes: 1995142008 # InnoDB_trx_id: 856C30 # QC_Hit: No Full_scan: Yes Full_join: Yes Tmp_table: Yes Tmp_table_on_disk: Yes # Filesort: Yes Filesort_on_disk: Yes Merge_passes: 662 # InnoDB_IO_r_ops: 1 InnoDB_IO_r_bytes: 16384 InnoDB_IO_r_wait: 0.034391 # InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000 # InnoDB_pages_distinct: 11 SET timestamp=1325073150; select customer.store_id, address.address, category.name from customer, address, (select * from category) as category order by customer.store_id limit 10000;

See how much more information the slow query log has now, it reports everything from InnoDB stats to on disk filesort was needed for the query, etc. Now you can clearly see this is much much more helpful then the regular slow query statistics reported by the vanilla MySQL.

Conclusion

The only conclusion, I can make out is “Get yourself Percona Server, turn on log_slow_verbosity and start using pt-query-digest”, your job of identifying bottleneck queries will be all the more simpler then.