PostgreSQL Bloat: origins, monitoring and managing

In Robert M. Wysocki's latest Write Stuff article, he looks at the wider aspects of monitoring and managing the bloat in PostgreSQL.

PostgreSQL's MVCC model provides excellent support for running multiple transactions operating on the same data set. One natural consequence of its design is the existence of so-called "database bloat". Judging by the amount of questions raised in the Internet it is quite a common problem and not many people seem to know how to properly deal with it. I myself had the same issues and learnt one or two things that might be helpful, so in this article I'd like to shed some light on this notion - why it's there in the first place, how it can affect performance and finally how to monitor and manage it.

Why it's there

In PostgreSQL MVCC - which stands for MultiVersion Concurrency Control - provides each database transaction with a consistent snapshot of the data. (It's important to remember that if no explicit transaction has been opened, each query is a transaction on its own.) This means that changes made in one transaction will become visible to other transaction only after their "origin" transaction successfully commits. Actually, it also depends on other transactions' isolation level, but to keep this article simple let's not worry about that.

What's really amazing about MVCC is that it manages to deal with quite a complex task like concurrency control using a very simple and clean design.

In order to understand why databases get bloated, it's important to get a bit more familiar with MVCC, so let's have a look.

compose (session 1) => create table test as select id from generate_series(1,1000) as id; SELECT 1000 compose (session 1) => begin; BEGIN compose (session 1) => select txid_current(); txid_current -------------- 1844 (1 row) compose (session 1) => select xmin, xmax, id from test limit 5; xmin | xmax | id ------+------+---- 1843 | 0 | 1 1843 | 0 | 2 1843 | 0 | 3 1843 | 0 | 4 1843 | 0 | 5 (5 rows)

As you can see, in database session 1 I've created a table called test with only one integer column id and filled it with 1000 rows. Next, I've opened an explicit transaction in which you can see it's identifier and first 5 rows of the test table.

Apart from the table's id column I've instructed the query to print also two system columns: xmin and xmax . The former stores the identifier of transaction that has inserted the row and the latter - the identifier of transaction that has deleted it. In case the row hasn't been deleted it, xmax equals 0 .

The explicit transaction I've opened has been assigned 1844 identifier. It is greater-or-equal than 1843 , so this particular transaction is able to see the rows.

Now please observe what happens in a second database session.

compose (session 2) => begin; BEGIN compose (session 2) => select txid_current(); txid_current -------------- 1846 (1 row) compose (session 2) => select xmin, xmax, id from test limit 2; xmin | xmax | id ------+------+---- 1843 | 0 | 1 1843 | 0 | 2 (2 rows)

Nothing surprising here, I hope that's clear.

Let's delete a tuple in the first session (remember that we still have a transaction opened there).

compose (session 1) => delete from test where id = 1; DELETE 1 compose (session 1) => select xmin, xmax, id from test where id = 1; xmin | xmax | id ------+------+---- (0 rows)

And right after that in session 2:

compose (session 2) => select xmin, xmax, id from test where id = 1; xmin | xmax | id ------+------+---- 1843 | 1844 | 1 (1 row)

Although this session's identifier is greater-or-equal than 1844 , it can still see the deleted row, since the deleting transaction hasn't committed its work yet.

This is really important - even though the row has been deleted, it has not been physically removed from the data page. It can't be removed, since some transactions may still see it in their snapshots as a standard, live row. Same thing goes for updates, since an UPDATE statement is under the hood nothing else, but a DELETE and INSERT combined.

Now it's time to clean up:

compose (session 1) => commit; COMMIT compose (session 2) => select xmin, xmax, id from test where id = 1; xmin | xmax | id ------+------+---- (0 rows)

After transaction run in session 1 commits, transaction running in session 2 immediately ceases to see the deleted row.

compose (session 2) => vacuum verbose test; INFO: vacuuming "public.test" INFO: "test": removed 1 row versions in 2 pages INFO: "test": found 1 removable, 999 nonremovable row versions in 5 out of 5 pages DETAIL: 0 dead row versions cannot be removed yet.

As you can see, after all is done, VACUUM takes care of the deleted tuple; it checks that there are no longer any transactions running that can see this row and it removes it (well, it's a bit more complicated, but about that later). Rows like this are called "dead rows" and if not for VACUUM , they would remain in the database's files forever.

This simple exercise demonstrates the process leading to formation of what is known as "database bloat". Under certain circumstances, with autovacuum daemon not aggressive enough, for heavily-written tables bloat can be a problem that has to be taken care of by the DBA.

The mechanics of MVCC make it obvious why VACUUM exists and the rate of changes in databases nowadays makes a good case for the existence of autovacuum daemon . Although it's common for new PostgreSQL users to be reluctant about setting autovacuum to on (or eager to turn it off, as it's on by default), folks at compose.io did a good job.

compose (session 2) => select name, setting from pg_settings where name = 'autovacuum'; name | setting ------------+--------- autovacuum | on (1 row)

What problems does it cause

Now imagine your table has a lot of dead tuples. Each sequential scan going through the table has to make its way to live rows dodging the dead ones. Obviously, it makes it go lot slower. On the other hand each index scan has to follow index pointers to dead rows just to see that they're... well, dead. So again - it could've been faster if not for the bloat. (Yes, indexes can also get bloated, but to keep things simple this article covers only data bloat.)

Another problem is that VACUUM does not actually remove all the dead tuples. It has some limitations. What VACUUM does is simply check, if the tuple is dead and if so, mark the space it utilises as suitable for being reused. So it most cases it actually does not return the disk space to the underlying filesystem. The reason for that is simple - data in PostgreSQL data files is organised in pages, each page having 8kB by default. VACUUM is able to shirk data files only, if the rows deemed dead are at the very end of the very last data page of the entire data file (it's worth mentioning here, that the data file is actually stored in a number of 1GB fragments).

I've seen cases where a large table had its rows updated few times and then was only read from; even though VACUUM has marked the dead rows as suitable for reuse, the data file was twice (or more!) the size of the table (or, to be exact, the sum of sizes of the live rows in the table in question).

What's also quite problematic when reading from heavily bloated tables is the I/O; the more data to go through the more input/output operations are needed.

As you can see, lots of dead rows can cause lots of problems, so it's necessary to at least keep track of the bloat percentage of your tables.

Let's have a look at one simple scenario:

compose (session 1) => truncate test; TRUNCATE TABLE compose (session 1) => insert into test (id) select * from generate_series(1,10000000); INSERT 0 10000000 compose (session 1) => \d+ List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+-------+--------+------------- public | test | table | admin | 346 MB | (1 row) compose (session 1) => explain (buffers, analyse) select * from test where id = 9999999; QUERY PLAN ------------------------------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..176992.00 rows=1 width=4) (actual time=855.088..855.089 rows=1 loops=1) Filter: (id = 9999999) Rows Removed by Filter: 9999999 Buffers: shared hit=16202 read=28046 Planning time: 0.043 ms Execution time: 855.120 ms (6 rows)

I've truncated our test table and inserted 10,000,000 new rows. Then, I've selected one of them. It took some time since there's no index and PostgreSQL had to go through all the data pages to find the requested row. What's most interesting in this exercise are the hit and read buffer counters; they show, how many rows have been found in shared buffers (hit) and how many have been read from disk (read).

Now, let's create some bloat!

compose (session 1) => delete from test where id > 1000 and id < 9000000; DELETE 8998999 compose (session 1) => vacuum analyse verbose test; INFO: vacuuming "public.test" INFO: "test": removed 8998999 row versions in 39820 pages INFO: "test": found 8998999 removable, 1001001 nonremovable row versions in 44248 out of 44248 pages DETAIL: 0 dead row versions cannot be removed yet. There were 0 unused item pointers. 0 pages are entirely empty. CPU 0.50s/1.05u sec elapsed 2.11 sec. INFO: analyzing "public.test" INFO: "test": scanned 30000 of 44248 pages, containing 671315 live rows and 0 dead rows; 30000 rows in sample, 993641 estimated total rows VACUUM compose (session 1) => \d+ List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+-------+--------+------------- public | test | table | admin | 346 MB | (1 row)

As you can see, I've deleted most of the rows and then run a VACUUM which reported that there are no pages entirely empty, so no pages have been removed hence the table size stays the same. The disk space occupied by the deleted rows have now been marked as available for reuse, but it is still part of this table's data file.

Let's see, what happens, if I'll run the same SELECT now.

compose (session 1) => explain (buffers, analyse) select * from test where id = 9999999; QUERY PLAN ------------------------------------------------------------------------------------------------------ Seq Scan on test (cost=0.00..56668.51 rows=1 width=4) (actual time=178.370..178.372 rows=1 loops=1) Filter: (id = 9999999) Rows Removed by Filter: 1001000 Buffers: shared hit=16219 read=28029 Planning time: 0.053 ms Execution time: 178.409 ms (6 rows)

It it faster, than the previous one, but notice that the hit and read buffer counters are more or less the same, as during the previous run. What this means is that this time PostgreSQL has read the buffers not directly from disk, but from OS cache. Basically the amount of work PostgreSQL had to do now was pretty much the same, as previously.

What can we do about it?

compose (session 1) => vacuum full analyse verbose test; INFO: vacuuming "public.test" INFO: "test": found 0 removable, 1001001 nonremovable row versions in 44248 pages DETAIL: 0 dead row versions cannot be removed yet. CPU 0.16s/0.36u sec elapsed 0.58 sec. INFO: analyzing "public.test" INFO: "test": scanned 4430 of 4430 pages, containing 1001001 live rows and 0 dead rows; 30000 rows in sample, 1001001 estimated total rows VACUUM compose (session 1) => \d+ List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+-------+-------+------------- public | test | table | admin | 35 MB | (1 row)

VACUUM FULL is one of the ways of removing bloat. It essentially rewrites the whole table (holding an AccessExclusiveLock while doing it). Much has been said about why not to use VACUUM FULL if there are other ways of dealing with bloat. But it's perfect for our simple exercise.

Thanks to a full rewrite, table size went down from ~350MB to 35MB.

Let's select the same row now.

compose (session 1) => explain (buffers, analyse) select * from test where id = 9999999; QUERY PLAN ---------------------------------------------------------------------------------------------------- Seq Scan on test (cost=0.00..16942.51 rows=1 width=4) (actual time=86.967..86.969 rows=1 loops=1) Filter: (id = 9999999) Rows Removed by Filter: 1001000 Buffers: shared hit=32 read=4398 Planning time: 0.125 ms Execution time: 87.005 ms (6 rows)

Thanks to removing the bloat we created, PostgreSQL has now much less data to go through and is able to satisfy our query much faster.

How to monitor it

Monitoring database bloat is pretty simple; the useful check_postgres script can do that for you. Apart from that, there are few various queries. It is also possible to use the pgstattuple module.

My personal favourite is a query by ioguix.

Let me show you its output on my test database.

compose (session 1) => truncate test; TRUNCATE TABLE compose (session 1) => insert into test (id) select * from generate_series(1,10000000); INSERT 0 10000000 compose (session 1) => delete from test where id > 1000 and id < 9000000; DELETE 8998999 compose (session 1) => \e current_database | schemaname | tblname | real_size | extra_size | extra_ratio | fillfactor | bloat_size | bloat_ratio | is_na ------------------+------------+---------+-----------+------------+------------------+------------+------------+------------------+------- compose | public | test | 362479616 | 324009984 | 89.3870909419635 | 100 | 324009984 | 89.3870909419635 | f (1 row)

How to deal with it

As already mentioned, the traditional way of dealing with bloat is using VACUUM FULL . This isn't really advised in production environments mainly because heavy locking - this utility query has to acquire and hold throughout its duration the strongest existing lock - AccessExclusiveLock . It basically takes the whole table out of any operation until it finishes.

But fortunately there are better ways of removing bloat nowadays. Basically, it can be done in one of two ways.

Firstly, we can rewrite the table as VACUUM FULL does, but in a more gracious way. The approach pg_repack takes is simple, yet very effective - let's create an identical, but empty table, place some triggers on the original one and copy its contents to the new one. Triggers take care of synchronising all the changes (including new rows) to the intermediate table while the process is running and after it's done, pg_repack recreates all the indexes and swaps the two tables. This way the original table isn't totally locked and there's only a brief outage when the tables have to be swapped. One downside though is that pg_repack requires extra disk space for the intermediate table.

Secondly, we can force table reorganisation to reuse the space marked as available by VACUUM runs. The great pgcompact script does that using the idea presented by depesz in two blog posts. Basically, knowing which pages in our data file have gaps we can try and execute so-called "empty updates" forcing PostgreSQL to create new row versions and to place them in the gaps we want to fill. Head to depesz's posts for detailed description. One other thing worth mentioning is that pgcompact is smart enough to prevent any triggers you might have from firing by setting session_replication_role to replica . Neat!

Let's have a look at clearing bloat (using pgcompact ) on my test database hosted at compose.io.

$ ./pgcompact -v info -h aws-eu-west-1-portal.2.dblayer.com -p 10392 -U admin -W my_secret_password -d compose -t test Sat Mar 12 17:05:10 2016 ERROR A database error occurred, exiting: DatabaseChooserError Can not find an adapter amongst supported: DatabaseError Can not execute command: SET lc_messages TO 'C'; SET session_replication_role TO replica; SET statement_timeout TO '0'; SET synchronous_commit TO off; ERROR: permission denied to set parameter "lc_messages" DatabaseError Can not execute command: SET lc_messages TO 'C'; SET session_replication_role TO replica; SET statement_timeout TO '0'; SET synchronous_commit TO off; ERROR: permission denied to set parameter "lc_messages" DatabaseError Can not execute command: SET lc_messages TO 'C'; SET session_replication_role TO replica; SET statement_timeout TO '0'; SET synchronous_commit TO off; SELECT 1; ERROR: permission denied to set parameter "lc_messages" ERROR: permission denied to set parameter "session_replication_role"

Oh darn! Looks like we need superuser privileges to set some of the parameters.

Let's try pg_repack .

$ ./pg_repack -E info -h aws-eu-west-1-portal.2.dblayer.com -p 10392 -U admin -d compose -t test ERROR: pg_repack failed with error: You must be a superuser to use pg_repack

Ouch! The thing is that pg_repack modifies system catalogues directly in order to swap the two tables. This can only be done by superuser.

Fortunately one of depesz's posts covers going through the process manually and there's no reason it shouldn't work.

Further reading

Some resources worth taking a look at: