As described by Magnus Hagander during his great talk about “PostgreSQL Replication in 2018” at the last FOSDEM event, PAF (aka. “PostgreSQL Automatic Failover” : http://clusterlabs.github.io/PAF/) is a Resource Agent providing service High Availability for PostgreSQL, based on Pacemaker and Corosync.

If you have good system skills and wish a reliable way of having automatic failover, you should definitively consider using PAF!

Let’s see in this post how to quickly install it.

Introduction

Pacemaker is nowadays one of the best references for High Availability. The Pacemaker + Corosync stack is able to detect failures on various services and automatically decide to failover the failing resource to another node when possible.

To be able to manage a specific service resource, Pacemaker interact with it through a so-called “Resource Agent”. A Resource Agent is an external program that abstracts the service it provides and present a consistent view to the cluster.

PostgreSQL Automatic Failover (aka. PAF : http://clusterlabs.github.io/PAF/) is a Resource Agent dedicated to PostgreSQL. Its original wish is to keep a clear limit between the Pacemaker administration and the PostgreSQL one, to keep things simple, documented and yet powerful.

Once your PostgreSQL cluster built using internal Streaming Replication, PAF is able to expose to Pacemaker what is the current status of the PostgreSQL instance on each node: master, slave, stopped, catching up, etc. Should a failure occur on the master, Pacemaker will try to recover it by default. Should the failure be non-recoverable, PAF allows the slaves to be able to elect the best of them (the closest one to the old master) and promote it as the new master. All of this thanks to the robust, feature-full and most importantly experienced project: Pacemaker.

Fencing

Fencing is one of the mandatory piece you need when building an highly available cluster for your database.

It’s the ability to isolate a node from the cluster.

Should an issue happen where the master does not answer to the cluster, successful fencing is the only way to be sure what is its status: shutdown or not able to accept new work or touch data. It avoids countless situations where you end up with split brain scenarios or data corruption

The documentation provides best practices and examples : http://clusterlabs.github.io/PAF/fencing.html

Quick start

The documentation also provides a few quick starts : http://clusterlabs.github.io/PAF/documentation.html

We’ll here focus on the Pacemaker administration part and assume the PostgreSQL Streaming Replication is working and correctly configured.

The resource agent requires the PostgreSQL instances to be already set up, ready to start and slaves ready to replicate.

Moreover, it requires a recovery.conf template ready to use.

You can create a recovery.conf file suitable to your needs, the only requirements are:

have standby_mode = on

have recovery_target_timeline = ‘latest’

a primary_conninfo with an application_name set to the node name

In case you configure a virtual IP (called pgsql-vip in this post) on the server hosting the master PostgreSQL instance, make sure each instance will not be able to replicate with itself (in the pg_hba.conf )!

Initial steps

For this article, I created 2 VMs (CentOS 7), using qemu-kvm through the virt-manager user interface.

# virsh list --all Id Name State ---------------------------------------------------- 1 server1 running 2 server2 running

I also installed (from the PGDG repository) and set up a PostgreSQL 10 cluster with Streaming Replication between those 2 servers.

The recovery.conf.pcmk template contains:

$ cat ~postgres/recovery.conf.pcmk standby_mode = on primary_conninfo = 'host=pgsql-vip application_name=server2' recovery_target_timeline = 'latest'

Pacemaker cluster preparation

Pacemaker installation:

# yum install -y pacemaker resource-agents pcs fence-agents-all fence-agents-virsh

We’ll later create one fencing resource per node to fence. They are called fence_vm_xxx and use the fencing agent fence_virsh, allowing to power on or off a virtual machine using the virsh command through a ssh connexion to the hypervisor. You’ll need to make sure your VMs are able to connect as root (it is possible to use a normal user with some more setup though) to your hypervisor.

Install the latest PAF version, directly from the PGDG repository:

# yum install -y resource-agents-paf

It is advised to keep Pacemaker off on server boot. It helps the administrator to investigate after a node fencing before Pacemaker starts and potentially enters in a death match with the other nodes. Make sure to disable Corosync as well to avoid unexpected behaviors.

Run this on all nodes:

# systemctl disable corosync # systemctl disable pacemaker

Let’s use the cluster management tool pcsd , provided by RHEL, to ease the creation and setup of a cluster.

It allows to create the cluster from command line, without editing configuration files or XML by hands.

pcsd uses the hacluster system user to work and communicate with other members of the cluster.

# passwd hacluster # systemctl enable pcsd # systemctl start pcsd

Now, authenticate each node to the other ones using the following command:

# pcs cluster auth server1 server2 -u hacluster Password: server1: Authorized server2: Authorized

Create and start the cluster:

# pcs cluster setup --name cluster_pgsql server1 server2 Destroying cluster on nodes: server1, server2... server1: Stopping Cluster (pacemaker)... server2: Stopping Cluster (pacemaker)... server1: Successfully destroyed cluster server2: Successfully destroyed cluster Sending 'pacemaker_remote authkey' to 'server1', 'server2' server1: successful distribution of the file 'pacemaker_remote authkey' server2: successful distribution of the file 'pacemaker_remote authkey' Sending cluster config files to the nodes... server1: Succeeded server2: Succeeded Synchronizing pcsd certificates on nodes server1, server2... server1: Success server2: Success Restarting pcsd on the nodes in order to reload the certificates... server1: Success server2: Success # pcs cluster start --all server2: Starting Cluster... server1: Starting Cluster...

Check the cluster status:

# pcs status Cluster name: cluster_pgsql WARNING: no stonith devices and stonith-enabled is not false Stack: corosync Current DC: server1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: ... Last change: ... by hacluster via crmd on server2 2 nodes configured 0 resources configured Online: [ server1 server2 ] No resources Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled

Now the cluster run, let’s start with some basic setup of the cluster.

Run the following command from one node only (the cluster takes care of broadcasting the configuration on all nodes):

# pcs resource defaults migration-threshold=3 # pcs resource defaults resource-stickiness=10

This sets two default values for resources we’ll create in the next chapter:

migration-threshold: this controls how many time the cluster tries to recover a resource on the same node before moving it on another one.

resource-stickiness: adds a sticky score for the resource on its current node. It helps avoiding a resource move back and forth between nodes where it has the same score.

Node fencing configuration

We can now create one STONITH resource for each node. Each fencing resource will not be allowed to run on the node it is supposed to fence.

Note that in the port argument of the following commands, server[1-2] are the names of the virtual machines as known by libvirtd side and 192.168.122.1 is the IP of the qemu-kvm hypervisor.

# pcs cluster cib cluster1.xml # pcs -f cluster1.xml stonith create fence_vm_server1 fence_virsh \ pcmk_host_check = "static-list" pcmk_host_list = "server1" \ ipaddr = "192.168.122.1" login = "root" port = "server1" \ action = "reboot" identity_file = "/root/.ssh/id_rsa" # pcs -f cluster1.xml stonith create fence_vm_server2 fence_virsh \ pcmk_host_check = "static-list" pcmk_host_list = "server2" \ ipaddr = "192.168.122.1" login = "root" port = "server2" \ action = "reboot" identity_file = "/root/.ssh/id_rsa" # pcs -f cluster1.xml constraint location fence_vm_server1 avoids server1=INFINITY # pcs -f cluster1.xml constraint location fence_vm_server2 avoids server2=INFINITY # pcs cluster cib-push cluster1.xml

Check the cluster status:

# pcs status Cluster name: cluster_pgsql Stack: corosync Current DC: server1 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: ... Last change: ... by root via cibadmin on server1 2 nodes configured 2 resources configured Online: [ server1 server2 ] Full list of resources: fence_vm_server1 (stonith:fence_virsh): Started server2 fence_vm_server2 (stonith:fence_virsh): Started server1 Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled

Cluster resources creation

Now that the fencing is working, we can add all other resources and constraints all together in the same time.

Create a new offline CIB:

# pcs cluster cib cluster1.xml

We’ll create three resources: pgsqld , pgsql-ha , and pgsql-master-ip .

The pgsqld defines the properties of a PostgreSQL instance: where it is located, where are its binaries, its configuration files, how to monitor it, and so on.

# pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \ bindir = /usr/pgsql-10/bin pgdata = /var/lib/pgsql/10/data \ recovery_template = /var/lib/pgsql/recovery.conf.pcmk \ op start timeout = 60s \ op stop timeout = 60s \ op promote timeout = 30s \ op demote timeout = 120s \ op monitor interval = 15s timeout = 10s role = "Master" \ op monitor interval = 16s timeout = 10s role = "Slave" \ op notify timeout = 60s

The pgsql-ha resource controls all the pgsqld PostgreSQL instances in your cluster, decides where the primary is promoted and where the standby is started.

# pcs -f cluster1.xml resource master pgsql-ha pgsqld notify=true

The pgsql-master-ip resource controls the pgsql-vip (192.168.122.50) IP address. It is started on the node hosting the PostgreSQL master resource.

# pcs -f cluster1.xml resource create pgsql-master-ip ocf:heartbeat:IPaddr2 \ ip = 192.168.122.50 cidr_netmask = 24 op monitor interval = 10s

We now define the collocation between pgsql-ha and pgsql-master-ip . The start/stop and promote/demote order for these resources must be asymmetrical: we MUST keep the master IP on the master during its demote process so the standby receive everything during the master shutdown.

# pcs -f cluster1.xml constraint colocation add pgsql-master-ip with master pgsql-ha INFINITY # pcs -f cluster1.xml constraint order promote pgsql-ha then start pgsql-master-ip symmetrical=false kind=Mandatory # pcs -f cluster1.xml constraint order demote pgsql-ha then stop pgsql-master-ip symmetrical=false kind=Mandatory

We can now push our Cluster Information Base (aka. CIB) to the cluster, which will start all the magic stuff:

# pcs cluster cib-push cluster1.xml

Check the cluster status:

# pcs status Cluster name: cluster_pgsql Stack: corosync Current DC: server1 ( version 1.1.16-12.el7_4.5-94ff4df ) - partition with quorum Last updated: ... Last change: ... by root via crm_attribute on server1 2 nodes configured 5 resources configured Online: [ server1 server2 ] Full list of resources: fence_vm_server1 ( stonith:fence_virsh ) : Started server2 fence_vm_server2 ( stonith:fence_virsh ) : Started server1 pgsql-master-ip ( ocf::heartbeat:IPaddr2 ) : Started server1 Master/Slave Set: pgsql-ha [ pgsqld] Masters: [ server1 ] Slaves: [ server2 ] Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled

And finally, try to connect:

$ psql -h pgsql-vip psql (10.1) Type "help" for help. postgres=# SELECT * FROM pg_stat_replication; -[ RECORD 1 ]----+---------------------------- pid | 1830 usesysid | 10 usename | postgres application_name | server2 client_addr | 192.168.122.77 client_hostname | server2 client_port | 42380 backend_start | ... backend_xmin | 555 state | streaming sent_lsn | 0/50002C0 write_lsn | 0/50002C0 flush_lsn | 0/50002C0 replay_lsn | 0/50002C0 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async postgres=# SELECT pg_is_in_recovery(); pg_is_in_recovery ------------------- f (1 row)

Conclusion

While the automatic failover influence the recovery time objective (RTO), the recovery point objective (RPO) is balanced by the PostgreSQL Streaming Replication.

To make this post as simple as possible, it shows how to build a “2-nodes” cluster. For production, you should consider having at least a “3-nodes” cluster.

The quick guides provided with the PAF project are quite clear and complete.

However, if your PostgreSQL instance is managed by Pacemaker, you should proceed to administration tasks with care.

Pacemaker only uses pg_ctl , and as other tools behave differently, using them could lead to some unpredictable behavior, like an init script reporting that the instance is stopped when it is not.

We’ll see in a future article how to correctly manage the cluster.