Knowledge of high availability is a must for anybody managing PostgreSQL. It is a topic that we have seen over and over, but that never gets old. In this blog, we are going to review a little bit of the history of PostgreSQL built-in replication features and deep dive into how streaming replication works.

When talking about replication, we will be talking a lot about WALs. So, let's review a little bit what is this about.

Write Ahead Log (WAL)

Write Ahead Log is a standard method for ensuring data integrity, it is automatically enabled by default.

The WALs are the REDO logs in PostgreSQL. But, what are the REDO logs?

REDO logs contain all changes that were made in the database and they are used by replication, recovery, online backup and point in time recovery (PITR). Any changes that have not been applied to the data pages can be redone from the REDO logs.

Using WAL results in a significantly reduced number of disk writes, because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.

A WAL record will specify, bit by bit, the changes made to the data. Each WAL record will be appended into a WAL file. The insert position is a Log Sequence Number (LSN) that is a byte offset into the logs, increasing with each new record.

The WALs are stored in pg_xlog (or pg_wal in PostgreSQL 10) directory, under the data directory. These files have a default size of 16MB (the size can be changed by altering the --with-wal-segsize configure option when building the server). They have a unique incremental name, in the following format: "00000001 00000000 00000000".

The number of WAL files contained in pg_xlog (or pg_wal) will depend on the value assigned to the parameter checkpoint_segments (or min_wal_size and max_wal_size, depending on the version) in the postgresql.conf configuration file.

One parameter that we need to setup when configuring all our PostgreSQL installations is the wal_level. It determines how much information is written to the WAL .The default value is minimal, which writes only the information needed to recover from a crash or immediate shutdown. Archive adds logging required for WAL archiving; hot_standby further adds information required to run read-only queries on a standby server; and, finally logical adds information necessary to support logical decoding. This parameter requires a restart, so, it can be hard to change on running prod databases if we have forgotten that.

For further information, you can check the official documentation here or here. Now that we’ve covered the WAL, let's review the replication history…

History of Replication in PostgreSQL

The first replication method (warm standby) that PostgreSQL implemented (version 8.2 , back in 2006) was based on the log shipping method.

This means that the WAL records are directly moved from one database server to another to be applied. We can say that is a continuous PITR.

PostgreSQL implements file-based log shipping by transferring WAL records one file (WAL segment) at a time.

This replication implementation has the downside that if there is a major failure on the primary servers, transactions not yet shipped will be lost. So there is a window for data loss (you can tune this by using the archive_timeout parameter, which can be set to as low as a few seconds, but such a low setting will substantially increase the bandwidth required for file shipping).

We can represent this method with the picture below:

PostgreSQL file-based log shipping

So, on version 9.0 (back in 2010), streaming replication was introduced.

This feature allowed us to stay more up-to-date than is possible with file-based log shipping, by transferring WAL records (a WAL file is composed of WAL records) on the fly (record based log shipping), between a master server and one or several slave servers, without waiting for the WAL file to be filled.

In practice, a process called WAL receiver, running on the slave server, will connect to the master server using a TCP/IP connection. In the master server another process exists, named WAL sender, and is in charge of sending the WAL registries to the slave server as they happen.

Streaming replication can be represented as following:

PostgreSQL Streaming replication

By looking at the above diagram we can think, what happens when the communication between the WAL sender and the WAL receiver fails?

When configuring streaming replication, we have the option to enable WAL archiving.

This step is actually not mandatory, but is extremely important for robust replication setup, as it is necessary to avoid the main server to recycle old WAL files that have not yet being applied to the slave. If this occurs we will need to recreate the replica from scratch.

When configuring replication with continuous archiving (as explained here), we are starting from a backup and, to reach the on sync state with the master, we need to apply all the changes hosted in the WAL that happened after the backup. During this process, the standby will first restore all the WAL available in the archive location (done by calling restore_command). The restore_command will fail when we reach the last archived WAL record, so after that, the standby is going to look on the pg_wal (pg_xlog) directory to see if the change exists there (this is actually made to avoid data loss when the master servers crashes and some changes that have already been moved into the replica and applied there have not been yet archived).

If that fails, and the requested record does not exist there, then it will start communicating with the master through streaming replication.

Whenever streaming replication fails, it will go back to step 1 and restore the records from archive again. This loop of retries from the archive, pg_wal, and via streaming replication goes on until the server is stopped or failover is triggered by a trigger file.

This will be a diagram of such configuration:

PostgreSQL streaming replication with continuous archiving

Streaming replication is asynchronous by default, so at some given moment we can have some transactions that can be committed in the master and not yet replicated into the standby server. This implies some potential data loss.

However this delay between the commit and impact of the changes in the replica is supposed to be really small (some milliseconds), assuming of course that the replica server is powerful enough to keep up with the load.

For the cases when even the risk of a small data loss is not tolerable, version 9.1 introduced the synchronous replication feature.

In synchronous replication each commit of a write transaction will wait until confirmation is received that the commit has been written to the write-ahead log on disk of both the primary and standby server.

This method minimizes the possibility of data loss, as for that to happen we will need for both, the master and the standby to fail at the same time.

The obvious downside of this configuration is that the response time for each write transaction increases, as we need to wait until all parties have responded. So the time for a commit is, at minimum, the round trip between the master and the replica. Readonly transactions will not be affected by that.

To setup synchronous replication we need for each of the stand-by servers to specify an application_name in the primary_conninfo of the recovery.conf file: primary_conninfo = '...aplication_name=slaveX' .

We also need to specify the list of the stand-by servers that are going to take part in the synchronous replication : synchronous_standby_name = 'slaveX,slaveY'.

We can setup one or several synchronous servers, and this parameter also specifies which method (FIRST and ANY) to choose synchronous standbys from the listed ones. For more information on how to setup this replication mode please refer here. It is also possible to set up synchronous replication when deploying via ClusterControl, from version 1.6.1 (which is released at the time of writing).

After we have configured our replication, and it is up and running, we will need to have some monitoring over it.

Monitoring PostgreSQL Replication

The pg_stat_replication view on the master server has a lot of relevant information:

postgres=# SELECT * FROM pg_stat_replication; pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state -----+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+-----------+-----------+- ----------+------------+-----------+-----------+------------+---------------+------------ 994 | 16467 | repl | walreceiver | 192.168.100.42 | | 37646 | 2018-05-24 21:27:57.256242-03 | | streaming | 0/50002C8 | 0/50002C8 | 0/50002C8 | 0/50002C8 | | | | 0 | async (1 row)

Let's see this in detail:

pid: Process id of walsender process

usesysid: OID of user which is used for Streaming replication.

usename: Name of user which is used for Streaming replication

application_name: Application name connected to master

client_addr: Address of standby/streaming replication

client_hostname: Hostname of standby.

client_port: TCP port number on which standby communicating with WAL sender

backend_start: Start time when SR connected to Master.

state: Current WAL sender state i.e streaming

sent_lsn: Last transaction location sent to standby.

write_lsn: Last transaction written on disk at standby

flush_lsn: Last transaction flush on disk at standby.

replay_lsn: Last transaction flush on disk at standby.

sync_priority: Priority of standby server being chosen as synchronous standby

sync_state: Sync State of standby (is it async or synchronous).

We can also see the WAL sender/receiver processes running on the servers.

Sender (Primary Node):

[[email protected] ~]# ps aux |grep postgres postgres 833 0.0 1.6 392032 16532 ? Ss 21:25 0:00 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/ postgres 847 0.0 0.1 244844 1900 ? Ss 21:25 0:00 postgres: logger process postgres 850 0.0 0.3 392032 3696 ? Ss 21:25 0:00 postgres: checkpointer process postgres 851 0.0 0.3 392032 3180 ? Ss 21:25 0:00 postgres: writer process postgres 852 0.0 0.6 392032 6340 ? Ss 21:25 0:00 postgres: wal writer process postgres 853 0.0 0.3 392440 3052 ? Ss 21:25 0:00 postgres: autovacuum launcher process postgres 854 0.0 0.2 247096 2172 ? Ss 21:25 0:00 postgres: stats collector process postgres 855 0.0 0.2 392324 2504 ? Ss 21:25 0:00 postgres: bgworker: logical replication launcher postgres 994 0.0 0.3 392440 3528 ? Ss 21:27 0:00 postgres: wal sender process repl 192.168.100.42(37646) streaming 0/50002C8

Receiver (Standby Node):

[[email protected] ~]# ps aux |grep postgres postgres 833 0.0 1.6 392032 16436 ? Ss 21:27 0:00 /usr/pgsql-10/bin/postmaster -D /var/lib/pgsql/10/data/ postgres 848 0.0 0.1 244844 1908 ? Ss 21:27 0:00 postgres: logger process postgres 849 0.0 0.2 392128 2580 ? Ss 21:27 0:00 postgres: startup process recovering 000000010000000000000005 postgres 851 0.0 0.3 392032 3472 ? Ss 21:27 0:00 postgres: checkpointer process postgres 852 0.0 0.3 392032 3216 ? Ss 21:27 0:00 postgres: writer process postgres 853 0.0 0.1 246964 1812 ? Ss 21:27 0:00 postgres: stats collector process postgres 854 0.0 0.3 398860 3840 ? Ss 21:27 0:05 postgres: wal receiver process streaming 0/50002C8

One way of checking how up to date is our replication is by checking the amount of WAL records generated in the primary, but not yet applied in the standby.

Master:

postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 0/50002C8 (1 row)

Note: This function is for PostgreSQL 10. For previous versions, you need to use: SELECT pg_current_xlog_location();

Slave:

postgres=# SELECT pg_last_wal_receive_lsn(); pg_last_wal_receive_lsn ------------------------- 0/50002C8 (1 row)

postgres=# SELECT pg_last_wal_replay_lsn(); pg_last_wal_replay_lsn ------------------------ 0/50002C8 (1 row)

Note: These functions are for PostgreSQL 10. For previous versions, you need to use: SELECT pg_last_xlog_receive_location(); and SELECT pg_last_xlog_replay_location();

We can use the following query to get the lag in seconds.

PostgreSQL 10:

SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;

Previous Versions:

SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay;

Output:

postgres=# SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() postgres-# THEN 0 postgres-# ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) postgres-# END AS log_delay; log_delay ----------- 0 (1 row)

To deploy streaming replication setups (synchronous or asynchronous), we can use ClusterControl:

Deploying PostgreSQL replication setups

It also allows us to monitor the replication lag, as well as other key metrics.

PostgreSQL Overview

PostgreSQL Topology View

As streaming replication is based on shipping the WAL records and them being applied to the standby server, it is basically saying what bytes to add or change in what file. As a result, the standby server is actually a bit by bit copy of the master.

We have here some well known limitations:

We cannot replicate into a different version or architecture.

We cannot change anything on the standby server.

We do not have much granularity on what we can replicate.

So, for overcoming these limitations, PostgreSQL 10 has added support for logical replication.

Logical Replication

Logical replication will also use the information in the WAL file, but it will decode it into logical changes. Instead of knowing which byte has changed, we will know exactly what data has been inserted in which table.

It is based in a publish and subscribe model with one or more subscribers, subscribing to one or more publications on a publisher node that looks like this:

PostgreSQL Logical Replication

To know more about Logical Replication in PostgreSQL, we can check the following blog.

With this replication option there many cases that now become possible, like replicating only some of the tables or consolidating multiple databases into a single one.

What new features will come? We will need to stay tuned and check, but we hope that master-master built-in replication is not far away.