My post almost 2 years ago about checking for PostgreSQL bloat is still one of the most popular ones on my blog (according to Google Analytics anyway). Since that’s the case, I’ve gone and changed the URL to my old post and reused that one for this post. I’d rather people be directed to correct and current information as quickly as possible instead of adding an update to my old post pointing to a new one. I’ve included my summary on just what exactly bloat is again below since that seemed to be the most popular part.

UPDATE: I wrote a followup post on methods for actually Cleaning Up PostgreSQL Bloat once you’ve identified it.

The intent of the original post was to discuss a python script I’d written for monitoring bloat status: pg_bloat_check.py. Since that time, I’ve been noticing that the query used in v1.x of that script (obtained from the check_postgres.pl module) was not always accurate and was often not reporting on bloat that I knew for a fact was there (Ex: I just deleted over 300 million rows, vacuumed & analyzed the table and still no bloat? Sure it could happen, but highly unlikely). So I continued looking around and discovered the pgstattuple contrib module that comes with PostgreSQL. After discussing it with several of the core developers at recent PostgreSQL conferences (PGConfUS & PGCon) I believe this is a much, much better way to get an accurate assessment of the bloat situation. This encouraged me to do a near complete rewrite of my script and v2.0.0 is now available. It’s not a drop-in replacement for v1.x, so please check the –help for new options.

pgstattuple is a very simple, but powerful extension. It doesn’t require any additional libraries to be loaded and just adds a few functions you can call on database objects to get some statistics about them. The key function for bloat being the default one, pgstattuple(regclass), which returns information about live & dead tuples and free space contained in the given object. If you read the description below on what bloat actually is, you’ll see that those data points are exactly what we’re looking for. The difference between what this function is doing and what the check_postgres.pl query is doing is quite significant, though. The check_postgres query is doing its best to guess what is dead & free space based on the current statistics in the system catalogs. pgstattuple actually goes through and does a full scan on the given table or index to see what the actual situation is. This does mean this query can be very, very slow on large tables. The database I got the examples below from is 1.2TB and a full bloat check on it takes just under 1 hour. But with the inaccuracies I’ve seen being returned by the simpler query, this time can be well worth it. The script stores the statistics gathered in a table so they can be easily reviewed at any time and even used for monitoring purposes, just like check_postgres.

Before showing what the script can do, I just want to re-iterate some things from my old post because they’re important. Bloat percentage alone is a poor indicator of actual system health. Small tables may always have a higher than average bloat, or there may always be 1 or 2 pages considered waste, and in reality that has next to zero impact on database performance. Constantly “debloating” them is more a waste of time than the space used. So the script has some filters for object size, wasted space and wasted percentage. This allows the final output of the bloat report to provide a more accurate representation of where there may actually be problems that need to be looked into.

Another option is a filter for individual tables or indexes to be ignored. If you understand why bloat happens, you will come across cases where a table is stuck at a certain bloat point at all times, no matter how many times you VACUUM FULL it or run pg_repack on it (those two things do remove it, but it quickly comes back). This happens with tables that have a specific level of churn with the rows being inserted, updated & deleted. The number of rows being updated/deleted is balanced with the number of rows being inserted/updated as well as the autovacuum schedule to mark space for reuse. Removing the bloat from tables like this can actually cause decreased performance because instead of re-using the space that VACUUM marks as available, Postgres has to again allocate more pages to that object from disk first before the data can be added. So bloat is actually not always a bad thing and the nature of MVCC can lead to improved write performance on some tables. On to the new script!

So as an example of why this new, slower method can be worth it, here’s the bloat report for a table and its indexes from the old script using check_postgres

Old Table Bloat: 2. public.group_members.........................................................(9.6%) 4158 MB wasted Old Index Bloat: 1. public.group_members_id_pk..................................................(19.5%) 4753 MB wasted 3. public.group_members_user_id_idx.............................................(9.6%) 2085 MB wasted 5. public.group_members_deleted_at_idx..........................................(6.2%) 1305 MB wasted 1 2 3 4 5 6 7 Old Table Bloat: 2. public.group_members.........................................................(9.6%) 4158 MB wasted Old Index Bloat: 1. public.group_members_id_pk..................................................(19.5%) 4753 MB wasted 3. public.group_members_user_id_idx.............................................(9.6%) 2085 MB wasted 5. public.group_members_deleted_at_idx..........................................(6.2%) 1305 MB wasted

Here’s the results from the statistic table in the new version

$ pg_bloat_check.py -c "dbname=prod" -t public.group_members kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ; objectname | object_size | reusable_space | dead_tuple_space | free_percent -----------------------------------+-------------+----------------+------------------+-------------- group_members | 42 GB | 16 GB | 4209 kB | 37.84 group_members_user_id_idx | 21 GB | 14 GB | 1130 kB | 64.79 group_members_id_pk | 24 GB | 16 GB | 4317 kB | 68.96 group_members_deleted_at_idx | 20 GB | 13 GB | 3025 kB | 63.77 group_members_group_id_user_id_un | 11 GB | 4356 MB | 6576 bytes | 38.06 group_members_group_id_idx | 17 GB | 9951 MB | 0 bytes | 56.8 group_members_updated_at_idx | 15 GB | 7424 MB | 0 bytes | 49.57 1 2 3 4 5 6 7 8 9 10 11 12 $ pg_bloat_check.py -c "dbname=prod" -t public.group_members kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ; objectname | object_size | reusable_space | dead_tuple_space | free_percent -----------------------------------+-------------+----------------+------------------+-------------- group_members | 42 GB | 16 GB | 4209 kB | 37.84 group_members_user_id_idx | 21 GB | 14 GB | 1130 kB | 64.79 group_members_id_pk | 24 GB | 16 GB | 4317 kB | 68.96 group_members_deleted_at_idx | 20 GB | 13 GB | 3025 kB | 63.77 group_members_group_id_user_id_un | 11 GB | 4356 MB | 6576 bytes | 38.06 group_members_group_id_idx | 17 GB | 9951 MB | 0 bytes | 56.8 group_members_updated_at_idx | 15 GB | 7424 MB | 0 bytes | 49.57

Yes, all those indexes did exist before. The old query just didn’t think they had any bloat at all. There’s also a nearly 4x difference in wasted space in the table alone. It’s only 37% of the table in this case, but if you’re trying to clean up bloat due to low disk space, 12GB can be a lot. Another really nice thing pgstattuple provides is a distinction between dead tuples and reusable (free) space. You can see the dead tuple space is quite low in this example. That means autovacuum is running efficiently on this table and marking dead rows from updates & deletes as re-usable. If you see dead tuples is high, that could indicate autovacuum is not running properly and you may need to adjust some of the vacuum tuning parameters that are available. In this case, even a normal vacuum was not freeing the reusable space back to the operating system. See below for why this is. This means either a VACUUM FULL or pg_repack run is required to reclaim it. Here’s the result from making a new index on user_id:

kfiske@prod=# CREATE INDEX concurrently ON group_members USING btree (user_id); CREATE INDEX Time: 5308849.412 ms $ pg_bloat_check.py -c "dbname=prod" -t public.group_members kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ; objectname | object_size | reusable_space | dead_tuple_space | free_percent -----------------------------------+-------------+----------------+------------------+-------------- group_members | 42 GB | 16 GB | 2954 kB | 37.84 group_members_user_id_idx | 21 GB | 14 GB | 1168 kB | 64.79 group_members_id_pk | 24 GB | 16 GB | 4317 kB | 68.96 group_members_deleted_at_idx | 20 GB | 13 GB | 3025 kB | 63.77 group_members_group_id_user_id_un | 11 GB | 4356 MB | 6784 bytes | 38.06 group_members_group_id_idx | 17 GB | 9951 MB | 0 bytes | 56.8 group_members_updated_at_idx | 15 GB | 7424 MB | 0 bytes | 49.57 group_members_user_id_idx1 | 8319 MB | 817 MB | 336 bytes | 9.83 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 kfiske@prod=# CREATE INDEX concurrently ON group_members USING btree (user_id); CREATE INDEX Time: 5308849.412 ms $ pg_bloat_check.py -c "dbname=prod" -t public.group_members kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ; objectname | object_size | reusable_space | dead_tuple_space | free_percent -----------------------------------+-------------+----------------+------------------+-------------- group_members | 42 GB | 16 GB | 2954 kB | 37.84 group_members_user_id_idx | 21 GB | 14 GB | 1168 kB | 64.79 group_members_id_pk | 24 GB | 16 GB | 4317 kB | 68.96 group_members_deleted_at_idx | 20 GB | 13 GB | 3025 kB | 63.77 group_members_group_id_user_id_un | 11 GB | 4356 MB | 6784 bytes | 38.06 group_members_group_id_idx | 17 GB | 9951 MB | 0 bytes | 56.8 group_members_updated_at_idx | 15 GB | 7424 MB | 0 bytes | 49.57 group_members_user_id_idx1 | 8319 MB | 817 MB | 336 bytes | 9.83

You can see the new index group_members_user_id_idx1 is now down to only 9% wasted space and much smaller. Here’s the result after running pg_repack to clear both the table and all index bloat:

kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ; objectname | object_size | reusable_space | dead_tuple_space | free_percent -----------------------------------+-------------+----------------+------------------+-------------- group_members | 25 GB | 27 MB | 79 kB | 0.1 group_members_id_pk | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_user_id_idx | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_deleted_at_idx | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_group_id_user_id_un | 7818 MB | 768 MB | 0 bytes | 9.83 group_members_group_id_idx | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_updated_at_idx | 8318 MB | 818 MB | 0 bytes | 9.83 (7 rows) 1 2 3 4 5 6 7 8 9 10 11 kfiske@prod=# select objectname, pg_size_pretty(size_bytes) as object_size, pg_size_pretty(free_space_bytes) as reusable_space, pg_size_pretty(dead_tuple_size_bytes) dead_tuple_space, free_percent from bloat_stats ; objectname | object_size | reusable_space | dead_tuple_space | free_percent -----------------------------------+-------------+----------------+------------------+-------------- group_members | 25 GB | 27 MB | 79 kB | 0.1 group_members_id_pk | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_user_id_idx | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_deleted_at_idx | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_group_id_user_id_un | 7818 MB | 768 MB | 0 bytes | 9.83 group_members_group_id_idx | 8319 MB | 818 MB | 0 bytes | 9.83 group_members_updated_at_idx | 8318 MB | 818 MB | 0 bytes | 9.83 (7 rows)

PostgreSQL 9.5 introduced the pgstattuple_approx(regclass) function which tries to take advantage of some visibility map statistics to increase the speed of gathering tuple statistics but possibly sacrificing some accuracy since it’s not hitting each individual tuple. It only works on tables, though. This option is available with the script using the –quick argument. There’s also the pgstatindex(regclass) that gives some more details on index pages and how the data in them is laid out, but I haven’t found a use for that in the script yet.

The same output options the old script had are still available: –simple to provide a text summary useful for emails & –dict which is a python dictionary that provides a structured output and also greater details on the raw statistics (basically just the data straight from the table). UPDATE: As of version 2.1.0 of the script, the –json & –jsonpretty options have been added and are the preferred structured output format unless you actually need a python dictionary. The table inside the database provides a new, easy method for reviewing the bloat information as well, but just be aware this is rebuilt from scratch every time the script runs. There’s also a new option which I used above (-t, –tablename) that you can use to get the bloat information on just a single table. See the –help for more information on all the options that are available.

Why Bloat Happens

For those of you newer to PostgreSQL administration, and this is the first time you may be hearing about bloat, I figured I’d take the time to explain why this scenario exists and why tools like this are necessary (until they’re hopefully built into the database itself someday). It’s something most don’t understand unless someone first explains it to them or you run into the headaches it causes when it’s not monitored and you learn about it the hard way.

MVCC (multi-version concurrency control) is how Postgres has chosen to deal with multiple transactions/sessions hitting the same rows at (nearly) the same time. The documentation, along with wikipedia provide excellent and extensive explanations of how it all works, so I refer you there for all the details. Bloat is a result of one particular part of MVCC, concentrated around the handling of updates and deletes.

Whenever you delete a row, it’s not actually deleted, it is only marked as unavailable to all future transactions taking place after the delete occurs. The same happens with an update: the old version of a row is kept active until all currently running transactions have finished, then it is marked as unavailable. I emphasize the word unavailable because the row still exists on disk, it’s just not visible any longer. The VACUUM process in Postgres then comes along and marks any unavailable rows as space that is now available for future inserts or updates. The auto-vacuum process is configured to run VACUUM automatically after so many writes to a table (follow the link for the configuration options), so it’s not something you typically have to worry about doing manually very often (at least with more modern versions of Postgres).

People often assume that VACUUM is the process that should return the disk space to the file system. It does do this but only in very specific cases. That used space is contained in page files that make up the tables and indexes (called objects from now on) in the Postgres database system. Page files all have the same size and differently sized objects just have as many page files as they need. If VACUUM happens to mark every row in a page file as unavailable AND that page also happens to be the final page for the entire object, THEN the disk space is returned to the file system. If there is a single available row, or the page file is any other but the last one, the disk space is never returned by a normal VACUUM. This is bloat. Hopefully this explanation of what bloat actually is shows you how it can sometimes be advantageous for certain usage patterns of tables as well, and why I’ve included the option to ignore objects in the report.

If you give the VACUUM command the special flag FULL, then all of that reusable space is returned to the file system. But VACUUM FULL does this by completely rewriting the entire table (and all its indexes) to new pages and takes an exclusive lock on the table the entire time it takes to run (CLUSTER does the same thing, but what that does is outside the scope of this post). For large tables in frequent use, this is problematic. pg_repack has been the most common tool we’ve used to get around that. It recreates the table in the background, tracking changes to it, and then takes a brief lock to swap the old bloated table with the new one.

Why bloat is actually a problem when it gets out of hand is not just the disk space it uses up. Every time a query is run against a table, the visibility flags on individual rows and index entries is checked to see if is actually available to that transaction. On large tables (or small tables with a lot of bloat) that time spent checking those flags builds up. This is especially noticeable with indexes where you expect an index scan to improve your query performance and it seems to be making no difference or is actually worse than a sequential scan of the whole table. And this is why index bloat is checked independently of table bloat since a table could have little to no bloat, but one or more of its indexes could be badly bloated. Index bloat (as long as it’s not a primary key) is easier to solve because you can either just reindex that one index, or you can concurrently create a new index on the same column and then drop the old one when it’s done.

In all cases when you run VACUUM, it’s a good idea to run ANALYZE as well, either at the same time in one command or as two separate commands. This updates the internal statistics that Postgres uses when creating query plans. The number of live and dead rows in a table/index is a part of how Postgres decides to plan and run your queries. It’s a much smaller part of the plan than other statistics, but every little bit can help.

I hope this explanation of what bloat is, and how this tool can help with your database administration, has been helpful.

Also published on Medium.