Simple Index Checking with PostgreSQL

Matt Barr of mySidewalk takes up the Write Stuff challenge this time round, looking at quick and simple way to see how well your indexing is performing on your PostgreSQL database.

At mySidewalk, we provide planners, engineers, elected offices, and their communities with a powerful suite of engagement, statistical, and geographic tools. Our mission requires us to manage and process large, complex, and constantly growing datasets but we have a simple philosophy around performance: it’s not a feature, it’s the feature. Scale, reliability, and efficiency are mission critical, too. Using PostgreSQL as our primary database allows us to pursue these goals with its rich table and index usage statistics. I’m going to illustrate an easy way to pull and analyze statistics about your PostgreSQL database.

Assumptions: What you need in order to make use of this post

You are familiar with PostgreSQL and the basics of tables, indexes, and pages

You are familiar with the basics of I/O and memory on your database server

You would like to better understand the characteristics of a PostgreSQL database with at least default statistics collection enabled and access to the statistics views

Gathering Your Statistics: The query and its output

Gathering and arranging these statistics is a fairly short query. Each of the statistics views involved in the query is well documented (specifically pg_stat_user_tables , pg_statio_user_tables , and pg_statio_user_indexes ). Remember that pages, blocks, and buffers are often used interchangeable in the PostgreSQL community and documentation. Remember also that these statistics can be reset manually but are not usually; they often represent the life of the database (more on this later).

Column breakdown

Note that all ratios are out of 1.0 and higher is better. Other numbers are simple counts of either rows or pages.

relname - The name of the table

- The name of the table n_live_tup - The number of rows in this table

- The number of rows in this table index_use_ratio - The ratio of index use vs table scan for all queries on the table

- The ratio of index use vs table scan for all queries on the table table_page_read - The count of heap (table data) pages read from disk for the table

- The count of heap (table data) pages read from disk for the table table_page_hit - The count of heap pages read from cache (memory) for the table

- The count of heap pages read from cache (memory) for the table table_hit_ratio - The ratio of heap page hits for the table

- The ratio of heap page hits for the table indexrelname - The name of the index belonging to the table

- The name of the index belonging to the table idx_page_read - The count of index pages read from disk for the index

- The count of index pages read from disk for the index idx_page_hit - The count of index pages read from cache (memory) for the index

- The count of index pages read from cache (memory) for the index idx_hit_ratio - The ratio of index page hits for the index

Analyzing Your Ratios: Logarithmic rule of thumb

In analyzing these statistics, the ratios will be the most telling. A good index use ratio to target is highly dependent on your workload and the size of the table but for speedy answers to queries on non-trivial tables, you’ll want to be using indices (trying to push toward 100% by identifying queries that are not taking advantage of any index).

In contrast, common wisdom about page cache hit ratios is that, for most applications, .99 or higher is desirable. Further, it’s helpful to group these ratios logarithmically such that you can distinguish >.99, >.9, and everything else easily (if your application domain demands an even better hit ratio, you can just extend this logic, i.e. separate out >.999 also).

A good spreadsheet tool with conditional formatting makes this analysis trivial. Let’s walk through some real life, production examples using Google Sheets.

Example 1: Fully optimized

The above represents a highly optimized table with layout, indices, and clustering (heap order in PostgreSQL) designed for the queries it services. Indexes are used ~10,000 times as often as table scans and page hits are ~100 times as common as misses. The one remark you can make about this table is that one of the indexes is used as much as 20 times more frequently. This insight has actually sparked an internal discussion about moving the surrogate primary key to a natural key (which is represented by the first index anyway) and dropping the pkey index and column; this would only drive the other ratios even higher.

Example 2: Lower than optimal index hit ratio

We keep track of individual views of interesting content in our application for fine grained reporting and analytics. For obvious reasons, this table is written to almost constantly and contains a lot of data. From that point of view, a 97% hit ratio on the biggest most used index is impressive, however, it is under our target hit ratio of .99 and thus a topic for optimization conversation.

In this particular situation, we’ve denormalized certain aggregates on content views to another table just to get to the ratios we have. The likely next step is partitioning older data to shrink the table and its indexes and optimizing the hit ratios of the more frequently accessed newest data (consequently seeing poorer ratios on the less frequently accessed partitions).

Example 3: It hurts to watch

Historically, most applications and backend systems that serve ballot data use street segments (ranges of street addresses that receive a particular ballot combination), precincts (basically polling locations), and precinct splits (partial precincts that will receive the same ballot). In our initial attempts delivering sample ballots to users, we used the same data model as the systems from which we were aggregating the data. A street segment table is large (both in row count and row width), poorly ordered, and accessed randomly and these things show in the ratios presented.

In cases where all ratios are suboptimal, the problem most often comes down to one of data layout. We sidestepped the issue by designing a new data model depending on the geographical boundaries of specific election contests (which leverages of our geo-statistics knowledge and makes the data organized and compact). We maintain the street segment data in order to answer queries about polling location, however, so it could use optimization. A quick analysis of the sheet reveals:

There are many indexes with great variance in utilization

The table is large and scanned occasionally

Even the best index has a low hit ratio, indicative of inefficient/inconsistent access

Dropping several of the indexes then creating and clustering on a new index (optimizing the most frequent access pattern) has yielded promising results so far.



Suggestions and next steps

The method above allows you to be proactive about performance issues rather than putting out performance fires once they reach “unacceptable”. I hope you try this on your database and find the information and examples instructive. Additional ideas and resources to pursue can be found below.

There are many ways you can automate the running of this query and push its output to a conditionally formatted spreadsheet for an entire ops/engineering team to analyze (Cron, your favorite scripting language, and Google Sheets are appealing starters).

Databases with a lot of wide fields (text, geometry, blobs, etc.) will benefit from analyzing TOAST statistics.

Comparing index entires read vs tuples fetched can tell you about the selectivity of your indexes.

Some kind of plan around statistics reset should be developed, a scheduled reset (works perfectly if you retain historical statistics) or resetting individual tables (when their schema or indexes change) are good candidates.

PostgreSQL maintains statistics on functions, too; if you do a lot in your functions, these are worth looking into.

Further reading