November 2013 Replication Data Loss Issue

What is the November 2013 Replication Data Loss Issue?

This is an issue, discovered Nov. 18, 2013., which can cause data corruption on a Hot-Standby replica when it is (re-)started, by marking committed transactions as uncommitted. This issue is fixed in the December 5th 2013 update releases.

Hackers mailing list discussion here

What are the symptoms of the issue?

The primary symptom of this corruption is rows that:

are present on the master, but missing on the replica

have been deleted on the master still appear to be visible on the replica

have been updated, and their old versions appear alongside the new, updated versions on the replica

Any of the above issues can cause key and constraint violations on the corrupted table.

Even if the issue has occurred it will often have no visible effect due to correctly set Hint Bits.

Who is at risk for this issue?

Users who:

are on one of the following PostgreSQL versions: 9.3.0, 9.3.1, 9.2.5, 9.1.10, and 9.0.14. Earlier versions are not affected.

Replicas which:

Have Hot Standby enabled: hot_standby = on in the configuration

Use recovery.conf to configure Streaming-Replication, Archive based Replication or PITR

Have (re-)started the database using any of the affected versions

The likelihood of the issues increases

by restarting the standby frequently

by having many short transaction performing write queries at the point the standby was restarted

by having infrequent checkpoints (high checkpoint_timeout, checkpoint_segments settings)

when data is never read on the primary

when having configured full_page_writes = off

When will this be fixed?

The PostgreSQL project released an update release on December 5th which fixes this issue. We strongly advise all users who are using the builtin replication to apply that update. A patch for the issue has been committed to the git branches of all affected versions.

What can I do to prevent this issue until then?

If you are currently using 9.2.4, 9.1.9 or 9.0.13 and use any form of builtin replication do not install the most recent update. Instead, wait for the next update (9.2.6, 9.1.11 and 9.0.15) to come out.

Options for users who have already updated, or are running 9.3, include:

if you are using 9.2.5, 9.1.10 or 9.0.14, downgrade your replica servers to the prior update release (9.2.4, 9.1.9 or 9.0.13). Packages from apt.postgresql.org for these versions can be found at http://apt.postgresql.org/pub/repos/apt/nov2013replicationissue/

disable read-only access to replicas by setting hot_standby=off until a fixed version has been installed

start your replica by taking a new base backup with write traffic halted on the master (i.e. a downtime)

minimize the number of times you restart your replicas

applying the source code patch which fixes the issue. This requires users to install from source or to build their own packages.

In any case, we recommend that all users who were running streaming replication under one of the affected versions recreate each of their replicas from a fresh base backup, either:

after downgrading the replica, or

after applying the patch, or

after applying the update when it is released, or

after disabling Hot-Standby by setting hot_standby = off, or

after stopping write access to the primary

How can I verify whether I already have this corruption?

There is no known way to identify that the issue has affected a standby in the past but comparing the data from the primary with the standby.