Just like your house gets dirty and requires cleaning from time to time, your PostgreSQL database may accumulate “dust” too. Unwanted pieces of data make the DB grow bigger and bigger. However, there is a mechanism to get rid of junk data and the DB can do it automatically for you.

How Does a Database Get “Dirty”?

Let’s talk a little bit about concurrency in databases. Generally, when multiple users (or more concretely – database transactions) read or modify the data stored in a database at the same time, they may conflict with each other. One user wants to read the row while another one may want to modify it. They do these operations within different transactions so they should be able to do their jobs without worrying about someone else doing something with the same data. It’s the DBMS’s job to handle it. So let’s look at this problem from a DBMS’s perspective. There are a few possible strategies to do this.

The first one is to play with locks. In this case, if one transaction wants to read the row, it must wait until the other transaction finishes updating it. Similarly, if a transaction wants to update the row, it must wait for all transactions reading it to release the lock. A consequence of such behavior is poor performance – all read queries must wait for an update to finish and the other way round – an update query must wait for all read operations to end.

PostgreSQL does this another way. It uses the MVCC (Multi-Version Concurrency Control) approach. The general concept is that each transaction has its own snapshot (or view) of all the data. Rows that are being modified by other transactions remain accessible with the values they had at the time the transaction started. It eliminates the stops caused by locks in the described scenario. The PostgreSQL documentation says:

The main advantage of using the MVCC model of concurrency control rather than locking is that in MVCC locks acquired for querying (reading) data do not conflict with locks acquired for writing data, and so reading never blocks writing and writing never blocks reading. PostgreSQL maintains this guarantee even when providing the strictest level of transaction isolation through the use of an innovative Serializable Snapshot Isolation (SSI) level.

So how does PostgreSQL handle updates? It simply inserts a new row for each row to be updated and marks the old row as out-of-date. The transactions that started before this updating transaction finishes will still see the old values. The same applies to a delete statement – the row being deleted is just marked as out-of-date. Actually, an update operation is a combination of delete and insert operations.

This approach, however, also has a negative consequence. The space occupied by the old rows is not reused even after all reading and modifying transactions finish. Why? Because every insert operation would need to scan the whole table in order to find this kind of free space in the middle of the table. This would be highly ineffective.

So, in short, that’s why the PostgreSQL database uses much more space than it really needs.

How Do You Clean a Database?

PostgreSQL comes with a VACUUM operation as a solution to this problem. Here’s how it works.

PostgreSQL has a special structure called FSM (Free Space Map). It holds the information about the rows (or pages, more concretely) which are no longer needed (i.e. they are marked as out-of-date and all transactions that could potentially use them are already finished). When a new row is being inserted, DBMS looks for a free space in the FSM first. This way the space will be reused.

The operation that updates FSM is VACUUM. It is very important to run VACUUM frequently, because FSM has limited capacity itself. If VACUUM is run too rarely, some old pages would still remain unused as there would be no space in FSM to keep information about them. It’s important to note that a standard VACUUM operation is non-blocking, so it may be invoked on production database without requiring maintenance downtime. It does require additional resources (it performs many I/O operations) so it may temporarily affect the overall performance, but it doesn’t require exclusive locks and thus doesn’t block active transactions.

There is also a VACUUM FULL operation. Instead of reusing old pages, it creates table structures from scratch. It’s gets rid of all old rows, which results in a better final effect, but, in opposition to standard VACUUM, it exclusively locks the tables and takes much more time. That’s why VACUUM FULL should be used only during maintenance downtimes.

There are two more things worth mentioning.

The first one is that PostgreSQL 8.1 or higher may perform VACUUM automatically. In fact, it is turned on by default since PostgreSQL 8.3 and it’s highly recommended to be left on. The autovacuum daemon looks at the statistics of inserts, updates and deletes and decides when to run VACUUM. To check when VACUUM last ran, just type:

select schemaname, relname, last_vacuum, last_autovacuum, vacuum_count, autovacuum_count from pg_stat_all_tables order by schemaname, relname;

The last thing I’d like to point out is planner statistics. VACUUM allows you to update them so that the planner can do its job better. This is done with the VACUUM ANALYZE operation.

Some Useful Links

You can find more detailed information about VACUUM in PostgreSQL documentation. Here are some links that you may find helpful: