Monitoring Postgres Logical Replication

Snoopy - Our Solution to Replication Lag Variability

Locked Out

It’s 3:00 AM. Your phone goes off, but you silence it because you need more sleep to recover from the night before. It goes off again — apparently it’s something serious. You wipe the sleep from your eyes, get out of bed, and turn on your laptop. There are 15 other people already working to get access to the database. Eventually you get in and discover it has run out of disk space (no wonder you couldn’t even log in). You think to yourself we have disk space monitoring, why didn’t this trigger alerts hours ago? No time for that now — we need to find what’s using all that space and fix it — fast.

When it comes to monitoring logical replication, the big questions are all centered around the impact of lag variability and how quickly things can go bad in our production databases. At Shipt, we designed Snoopy (named after a deterministic investigator on the show The Venture Brothers — a show particularly loved by the DBA team here) with the intention of gathering as much information and intelligence as possible to understand these scenarios. As the data began to pile up, the answers to our questions started to materialize.

What is Logical Replication anyway and why does it matter?

Logical replication is Postgres’ implementation of change data capture — specifically, replicating data objects and their changes. The subscriber pulls changes from the publisher as they are committed, which allows for transactional consistency since changes are applied in the same order as they occur. The replication configuration allows for specifying whether an initial snapshot of the data is copied over or not. There are many valid use-cases for logical replication (for instance, in one specific implementation — we are sharing a subset of data between a processing database, an application, and reporting databases). Regardless of the implementation, the monitoring needs are similar for all uses.

Write-Ahead Logging (WAL) builds up on the publisher until it has been acknowledged by the subscriber. Especially if the database load is highly change-oriented, the WAL can build up and quickly create a disk space emergency issue on the publisher. If production systems rely on the data to be reasonably up to date, this can also cause a staleness issue when querying the subscriber.

Logical Replication at Shipt

Shipt’s implementation has several monitoring requirements that include after-hours alerting, the ability to page separate teams, tracking lag trends, and monitoring multiple subscriptions across several Postgres databases. Since these custom monitoring requirements aren’t met by any pre-packaged product, we developed our own monitoring solution specifically for logical replication implementations.

When I was first approached to develop this system, the goal was simple: monitor our logical replication implementations. What became apparent was how big of a task that was and how little I knew about monitoring logical replication. With a quickly approaching deadline to take this system to production, I wanted to develop something that could be easily adapted and modified as I continued learning about Postgres logical replication. Project Snoopy was born and began monitoring our systems in staging almost immediately. As each day passed, it was changed and modified to become what it is now. It currently monitors multiple logical replication subscriptions, and I know I’ve only breached the surface of what’s possible.

Since Snoopy is monitoring a production system, we need a way to monitor the monitoring software. We are currently using a heartbeat system to ensure that Snoopy is online and functioning. If the system does not post 3 consecutive heartbeats, it triggers a P1 alert to the on-call DBA (P1 will page 24/7 until someone acknowledges the alert). This alert is configured in an alerts.yaml file.

Monitoring Logical Replication

We wanted our solution to be flexible enough to handle both the built in libraries as well as their extensions, so we focused on the two primary views that Postgres offers with information regarding logical replications subscriptions:

pg_replication_slots

pg_stat_replication

(I will note that pglogical also includes its own tables that may offer more detail or the same information.)

While these views offer everything needed for monitoring, they only provide an instantaneous snapshot of the replication status. We want to be able to alert on historical trends in lag as well.

Monitoring Historical Trends

By polling the views and storing the results in a data store with identifying information, we are able to solve the issue of instantaneous monitoring. The pg_stat_replication view describes currently running WAL sender processes. The view allows us to monitor the replication lag from the publisher. The log sequence number (LSN) is a hexadecimal value that identifies a WAL record describing data changes. When the subscriber replays WAL, it sends statistics back to the publisher indicating what has been replayed and what hasn’t. The difference allows us to calculate the lag in bytes.

The pg_stat_replication view contains one row per connection being used for streaming replication. The view contains helpful information in filtering subscriptions when monitoring. We use the application_name and usename columns to identify our subscriptions. Outside of other information specific to this connection, the view provides 4 columns of type pg_lsn that are vital for monitoring lag. Below is a table with the descriptions from Postgres’ documentation:

These values give us the position of WAL that can then be used to calculate the lag in bytes by comparing to the current_lsn of the publisher. Glancing at the view definition on the Postgres doc, you’re probably wondering about the write_lag , flush_lag , and replay_lag columns. In our experience, these values are NULL with logical replication. The documentation provides more detail on the other columns in this view.

**(Note for versions prior to Postgres 10, these columns as well as the below functions are named slightly different.)

How we handle multiple versions of Postgres

Postgres provides a function for comparing pg_lsn values to calculate distance between the two locations. The function pg_wal_lsn_diff takes two pg_lsn values and returns a numeric value type. We use the function pg_current_wal_lsn() to store the current LSN with each poll of the pg_stat_replication view. Taking the difference between the current LSN and the replay_lsn provides instantaneous lag. There is an added bonus of being able to run analytical queries over the data. The functionality is especially helpful when alerting on lag trends. As an example, we want to alert if lag grows larger than 200GB but also if lag continues to grow without ever dropping. The latter could indicate an issue but the workload is low enough where the total lag has not reached our initial threshold.

While the pg_replication_slots view offers several useful columns for monitoring, we are simply using it to monitor the active column. The documentation defines the column as being true if the slot is being used. In our experience with database to database streaming replication, this value should always remain true. For that reason, we want to trigger a P1 alert if the status changes to false. It should be mentioned that we have seen some third party consumer processes flip this flag between true and false, which allows us to know when the logs are being applied to those consumer services. We filter based on the slot_name to have specialty alerting for the third party consumer services that behave in this way. Polling allows us to look at the total amount of WAL generated and alert when those consumer services don’t apply after a certain amount of time, which could indicate an issue with the third party.

Observability

Polling these system views into a separate data store gives us the ability to build dashboards that are accessible and interpretable by non-technical users. We can use these dashboards to easily monitor from our phone as well as provide a nice first-glance look at all of the current subscriptions that are active. While we have determined one polling interval is sufficient for our current subscriptions, the system allows for custom polling intervals.

A view we use for monitoring lag over time

Alerting

Another advantage to having historical data is for the system to analyze the data and incorporate rolling thresholds for alerting. For example, if we know that a large data import is running over the weekend, we may adjust for a large lag threshold when we know the system will recover during lower activity periods. If lag grows above 20–40GB during a normally low activity period, it might indicate a degradation somewhere in the replication process. While this is still in active development, we have to have a big enough set of data in order to determine these trends. Eventually, the system will learn about different subscriptions and how the lag values grow and shrink during different times. Ideally, it will even learn trends about how quickly it can recover and potentially compare that to workloads. This is critical for determining how much flexibility a particular subscription has before it needs to be scaled.

The pg_stat_replication view is great at revealing what the actual problem might be. The monitoring system uses the view to calculate lag in different steps of the replication process and route that to necessary teams. If the lag is growing between the current_lsn value and the sent_lsn value, it might indicate a problem with the network connection.

What’s next?

As Snoopy’s data store begins to populate with enough data, the algorithms can be developed to analyze the lag over time and start implementing rolling thresholds for alerting. One of the main issues with implementing this feature is that lag during the initial sync/load can be significant and it shouldn’t be used in interpreting the monitoring thresholds going forward. The current leading idea is just to ignore the first week of data in analysis.

If our on-call management software changes, we can quickly update it to use a new API. If a new service is introduced, it can be monitored within seconds. We can temporarily adjust alerting thresholds. We can generate reports or send real-time lag statistics to any number of graphing softwares or communication tools like Grafana or Slack.

Replication lag graph from Grafana

More to Learn

One of the top requests from our business users is the logic to convert bytes of lag into real world time. If anyone has methodology for doing this, please let me know in the comments.

I hope this article helps you monitor your Postgres logical replication solutions, and I’d love to hear about solutions you’ve designed. If these solutions interest you or you’re interested in joining a great team that thinks outside of the box, we’re hiring!