Anyone who spends time optimizing a Relational Database Management System (RDBMS) can tell you that the largest performance gains can be made from query optimization. It's true that selecting good hardware and tuning the OS and server settings matter a lot in this process. However, by tuning a query, you can have a drastic impact on execution time -- in certain instances going from minutes to milliseconds. Any database server may have tens of thousands of unique queries running through it, so how do you identify which queries are worth your time to optimize? How do you track changes in performance over time? While MySQL has some existing tools out there for this purpose, none of them met all of our needs at Box. Additionally, there are some more sophisticated commercial products in the market, but their licensing fees can be prohibitive when you have many databases. This is why we created Anemometer . We built it on top of existing open source tools, which though they provided a lot of value, didn't take the final step of visualizing the data and making it as useful as it could be. This is what Anemometer is: an open source query performance visualization tool with powerful interactive reporting features. In this post I'll show some common methods used to gain insights into MySQL queries and explain why we needed to take the next step. In the end, we were able to increase our total throughput by 4x while simultaneously reducing our average query latency by 5x. That's an astounding result -- we scaled up our database usage and made it all happen faster at the same time!

What's the server doing?

What does it look like?

The first and best place to start gathering performance data is MySQL's Slow Query Log . This feature writes information about queries after they execute so you can view detailed information about what happened internally. You can set a threshold and log all queries which execute longer than N time. When it was first introduced, it had a minimum time resolution in seconds -- you couldn't log anything that wasn't slower than one second, and this was a major limitation. Now, however, you can set subsecond values, which means you can effectively log all queries. This is a huge improvement because one of the most important metrics is the aggregate execution time for all queries that follow the same pattern -- not the execution time of a single query. If you have a heavy reporting query which runs once an hour and takes 60 seconds, and a fast query which takes 50 miliseconds and executes 10 times per second, which one should you worry about? The first one takes 24*60 seconds daily (1440) but the second takes 24*60*60*10*0.050 seconds (43200!) This is a huge difference. If you were to spend your time trying to reduce the execution time by half of just one of these queries, the second one would be a much bigger win. MySQL isn't really affected that much by occasional long running queries, but it is often overwhelmed with many small queries even if they're normally very fast. So you've got the slow log on and now you're logging everything. But now you have a new problem. This is an absolute torrent of information -- not only is the log not readable from top to bottom, you can't leave it on all the time on a busy server! The solution is sampling. You can enable the log for a short period of time and then disable it again. To process these logs you'll also need some tools such as pt-query-digest from Percona . You can use it to digest your slow logs and come up with a report like this: [shell]# Profile # Rank Query ID Response time Calls R/Call Apdx V/M Item # ==== ================== ============= ===== ====== ==== ===== ========== # 1 0x9A77D60B6E48BE4D 0.1615 31.9% 260 0.0006 1.00 0.00 SELECT customer address # 2 0xB4EB3788763D984E 0.0964 19.0% 1 0.0964 1.00 0.00 SELECT information_schema.COLUMNS # 3 0xF47B7F81FA0E3307 0.0882 17.4% 1 0.0882 1.00 0.00 SELECT information_schema.PARTITIONS # 4 0xAD94A0AA9EAD66FF 0.0175 3.5% 1 0.0175 1.00 0.00 SELECT information_schema.VIEWS # 5 0xF34A9B78776B18DD 0.0143 2.8% 262 0.0001 1.00 0.00 SELECT actor # Query 1: 2.77 QPS, 0.00x concurrency, ID 0x9A77D60B6E48BE4D at byte 979815 # This item is included in the report because it matches --limit. # Scores: Apdex = 1.00 [1.0], V/M = 0.00 # Query_time sparkline: | -^_ | # Time range: 2012-09-24 14:25:18 to 14:26:52 # Attribute pct total min max avg 95% stddev median # ============ === ======= ======= ======= ======= ======= ======= ======= # Count 9 260 # Exec time 31 162ms 15us 4ms 621us 2ms 572us 657us # Lock time 83 19ms 0 345us 73us 176us 62us 89us # Rows sent 56 166 0 2 0.64 0.99 0.48 0.99 # Rows examine 98 104.87k 0 1.17k 413.01 592.07 331.40 592.07 # Rows affecte 0 0 0 0 0 0 0 0 # Rows read 56 166 0 2 0.64 0.99 0.48 0.99 # Bytes sent 2 312.26k 1.18k 1.38k 1.20k 1.20k 30.00 1.14k # Tmp tables 0 0 0 0 0 0 0 0 # Tmp disk tbl 0 0 0 0 0 0 0 0 # Tmp tbl size 0 0 0 0 0 0 0 0 # Query size 9 20.29k 78 80 79.90 76.28 0 76.28 # String: # Databases sakila # Hosts localhost # InnoDB trxID 4EE9 (1/0%), 4EEA (1/0%), 4EEB (1/0%)... 162 more # Last errno 0 # Users root # Query_time distribution # 1us # 10us ################################################ # 100us ################################################################ # 1ms ############################ # 10ms # 100ms # 1s # 10s+ # Tables # SHOW TABLE STATUS FROM `sakila` LIKE 'customer'\G # SHOW CREATE TABLE `sakila`.`customer`\G # SHOW TABLE STATUS FROM `sakila` LIKE 'address'\G # SHOW CREATE TABLE `sakila`.`address`\G # EXPLAIN /*!50100 PARTITIONS*/ SELECT * FROM customer JOIN address USING (address_id) WHERE postal_code='94065'\G[/shell] This sample report shows a lot of aggregated statistics about queries that executed while the slow query log was enabled. At the top is a summary of queries found in the log, ordered by total execution time. Below the summary is a more detailed analysis of a single query pattern. As you can see from the execution time, this query represents 31 percent of the time our sample database server spent running queries. Such a report helps a lot. If you've never seen statistics like this about your database server, it can be a revelation. However, if you really want the information to identify the queries in your system that need optimizing, this report may still not be enough. This report is only a small slice of time -- you really want to know how things change over time so you can see if the query that's hammering your server today was there yesterday as well. You also probably have many database servers, and this report only shows the logs from one -- you want to see statistics over all your servers, which means you need a way to aggregate this data across multiple servers. Where are the graphs? Information that's hard to tease out of text files can be discovered instantly by graphing this data over time. These are the reasons that motivated us to build Anemometer. But why throw away existing work? pt-query-digest can process and save its data to a database. We decided to build on this powerful tool and provide the reporting and visualization layer that makes the data even more useful.Again, we're using existing open source tools to build something that is functional, attractive, and familiar to almost everyone out there. We used Bootstrap and JQuery for the interface elements and interactivity. We used Flot for the amazing interactive HTML canvas graphs. The following sample is a basic view of query data in Anemometer:

In this sample you can instantly notice that performance over time isn't consistent at all -- there are a lot of changes! In Anemometer, you have the option to change which data you want to graph. You can even quickly compare different database servers all at once. Below the graph is a table which displays all queries that executed during the selected time period. The graph supports zooming so you can highlight a time range to drill down -- the table of queries updates dynamically so you can quickly see the exact queries which took place during that time. Graphing, grouping, sorting. It's so simple, but it adds so much power to the task of exploring your query data. Once you find a query you want to examine, you can get additional details (such as the explain plan and table structure), keep notes and see a history of query performance. This information is almost everything you need to decide how to optimize queries. One important note: at Box we append the program backtrace to each query as it's passed to the database server. This is a common best-practice that often helps us troubleshoot query issues. This makes it super easy to find the code associated with a query that needs work.