Recently one of our PMs flagged to us that our product analytics data was nearly 9 days behind the production database! We quickly found that this was due to MySQL replication lag. But to fix this lag, we had to dive deep into the internals of MySQL replication! In this post, I’ll show you those internals. Along the way, I’ll point out possible causes of replication lag, such as single-threaded replication, choice of binlog format, and disk flushing policy. I’ll show you how to diagnose the problem in your MySQL setup. Finally, I’ll describe how we fixed the replication lag in our case.

At Pusher, we have a product analytics system which looks like this:

As soon as we noticed that our product analytics were stale, we checked our CloudWatch metrics, and noticed that the “lag” metric for the MySQL replica was increasing indefinitely:

We initially guessed that the replica was 9 days out-of-date because replication had stopped 9 days ago. But that wasn’t the case: the lag had been increasing for 14 days, and replication was still running. The problem was that the replication could not keep up with the changes on the master.

But why? How can we investigate it? How can we fix it? Before we can start to answer those questions, we need to understand the key concepts that make up the internals of MySQL replication. And before we go into replication, we need to understand the MySQL binlog.

The binlog

MySQL uses a binlog to implement point-in-time recovery. Let’s say you take periodic backups of your database at points in time. RDS snapshots are an example of this. If the last backup was taken at 12:00, and the database was accidentally deleted at 12:27, how do you restore the updates that happened between 12:00 and 12:27?

This is where the binlog comes in. The binlog is a series of files in the data directory containing all committed updates to the database in chronological order. When a MySQL thread is processing a transaction, it will buffer all write operations until it receives a COMMIT , then it will write the updates to the binlog before applying them to the database itself.

When performing a point-in-time restore, the database is first loaded from a snapshot, then all updates in the binlog are applied up to the time that the database state was deleted.

As the name implies, the binlog is in a binary format, but you can view it as ASCII SQL statements with mysqlbinlog . Here’s an example of running it on a binlog on my machine with mysqlbinlog /var/lib/mysql/mysql-bin.000001 :

# at 1476 # 200309 17 : 06 : 29 server id 1 end_log_pos 1583 CRC32 0 xc20ce1f9 Query thread_id = 1 exec_time = 0 error_code = 0 SET TIMESTAMP = 1583773589 /*!*/ ; BEGIN /*!*/ ; # at 1583 # at 1622 # 200309 17 : 06 : 29 server id 1 end_log_pos 1622 CRC32 0 x068444cd Rand SET @@ RAND_SEED1 = 313260617 , @@ RAND_SEED2 = 516001561 /*!*/ ; # 200309 17 : 06 : 29 server id 1 end_log_pos 1785 CRC32 0 xb45a622d Query thread_id = 1 exec_time = 0 error_code = 0 SET TIMESTAMP = 1583773589 /*!*/ ; DELETE FROM test WHERE id BETWEEN 1 AND ROUND (( RAND () * 4 ) + 1 ) /*!*/ ; # at 1785 # 200309 17 : 06 : 29 server id 1 end_log_pos 1816 CRC32 0 x9c071f47 Xid = 53 COMMIT /*!*/ ; DELIMITER ; # End of log file ROLLBACK /* added by mysqlbinlog */ ; /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/ ; /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/ ;

This is valid SQL that you could run by piping it to the mysql client.

You can even read the binlog from a running MySQL instance with the --read-from-remote-server option.

We won’t go in to detail about all the metadata in it, but instead focus on the three main formats that binlogs can be written in. These will be relevant to our replication lag investigation.

STATEMENT format

The simplest format is STATEMENT : the binlog contains the exact updates that are applied to the database. If you look closely at the above binlog output, you will see the precise statement I ran that deletes some random keys:

DELETE FROM test WHERE id BETWEEN 1 AND ROUND (( RAND () * 4 ) + 1 );

So why would we need any other format? The disadvantage with STATEMENT form is it’s non-deterministic. A statement can apply different values depending on where or when it is run. For example it might depend on the current DATE() .

If you want the binlog to apply exactly the same values that were applied to the database, you can set the binlog to ROW format. ROW format translates transactions into a series of statements that only update a single row with concrete values.

ROW format

Now if we run mysqlbinlog on a binlog containing the above query, instead of the DELETE statement, we get this mysterious statement instead:

BINLOG ' aXpmXhMBAAAAPQAAAL8CAAAAAEcAAAAAAAEAEnB1c2hlcl9kZXZlbG9wbWVudAAEdGVzdAABAwAA mTdwwA== aXpmXiABAAAALQAAAOwCAAAAAEcAAAAAAAEAAgAB//4CAAAA/gUAAADVbbXb ' ;

To base64 decode this we can run mysqlbinlog with the options --base64-output=decode-rows --verbose , which gives us the output:

### DELETE FROM `pusher_development` . `test` ### WHERE ### @ 1 = 2 ### DELETE FROM `pusher_development` . `test` ### WHERE ### @ 1 = 5

giving us a textual representation of the exact modifications it will make.

The disadvantage with ROW binlog format is the binlogs become much larger. For example a single query to update one million records would result in one million entries in the binlog.

MIXED format

MySQL has a third format, MIXED , which helps solve this problem. It uses STATEMENT format by default, but switches to ROW for non-deterministic updates.

RDS MySQL logs in MIXED format by default. It can be configured by setting the binlog-format variable.

MySQL replication

Why did we just spend all this time talking about point-in-time recovery and the binlog? It’s because replication works in a very similar way.

Typically when you attach a replica to an existing database, you will first restore the state from some point-in-time snapshot. To “catch up” with the state on the master, the replica will connect to the master and start applying statements from the binlog. The replica will continue to apply these updates indefinitely.

Before discussing the process in more detail, let’s extend our diagram with some new components.

When a replica connects to the master, the master will start up a “binlog dump” thread (1 per replica), which reads from the binlog and asynchronously sends updates to the replica. The replica starts up two threads, the I/O thread, which reads updates from the binlog dump thread on the master and writes them to a relay log. The relay log is a series of files in the same format as the regular binlog. The second thread is the SQL thread which reads the relay log and applies the updates to the database.

The I/O thread keeps track of its current position in the master’s binlog by storing the current file and offset in “status log” files. That way the replica can resume replication if it disconnects or if the process is restarted.

Note that replication is single threaded. There is only a single instance of the I/O thread and the SQL thread. The updates are not applied in parallel like they might be on the master. We suspected early on that this was our problem, and looked into fixing it with parallel SQL threads. However, this wasn’t our eventual fix. First, this feature only parallelizes across different databases, but we only use a single database. Second, we found a different cause of lag that we were able to fix: read on!

Isolating which thread is struggling to keep up

Now we understand enough concepts to start investigating replication lag.

The first thing to do is figure out which of these threads is struggling to keep up. You can do this by working out which binlog file, and position within that file, each thread is currently processing. In a system with no lag, the binlog files and positions should be close to identical.

There are some helpful commands that tell us where in the binlog each thread is up to.

On the master run

mysql> SHOW MASTER STATUS\G *************************** 1. row *************************** File: mysql-bin-changelog.076928 Position: 27998964 1 row in set (0.08 sec)

This shows the current binlog file the master is writing to, and the current position within that file.

On the replica run

mysql> SHOW SLAVE STATUS\G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Log_File: mysql-bin-changelog.076928 Read_Master_Log_Pos: 32312254 Relay_Log_File: relaylog.053851 Relay_Log_Pos: 14281377 Relay_Master_Log_File: mysql-bin-changelog.076928 1 row in set (0.08 sec)

Master_Log_File and Read_Master_Log_Pos show the current binlog file on the master that the I/O thread is currently up to. If this is behind, then the I/O thread is struggling to keep up.

Relay_Log_File and Relay_Log_Pos show the current position in the relay logs that the SQL thread is reading from. This is less interesting because it doesn’t relate to the position in the binlog. What is more interesting is the Relay_Master_Log_File which shows the binlog file that contains the statement that the SQL thread last executed. If this is behind then it means the SQL thread is struggling to keep up.

Putting it all together to isolate the lag

In all cases it is worth checking there is no obvious resource saturation (CPU, memory) on the master and replica. The replica should also be a machine with the same resources as the master.

If the I/O thread is slow it likely means the network connection between the master and replica is unreliable or saturated.

In our case it was the SQL thread that was struggling. Unfortunately there can be a number of causes of this, and it can be tricky to isolate. This blog post has some good recommendation of things to look into.

During our investigation we noticed that the lag started soon after attaching the third-party MySQL instance (Fivetran) as a replica of our replica (you can daisy-chain replication like this). However it wasn’t immediately obvious what specifically about this would cause our master<->replica process to run more slowly.

By default replicas don’t write to their own binlog, however by attaching a replica to it, the source replica will start writing to a binlog.

However writing to a binlog didn’t explain why it couldn’t keep up - after all, the master was writing to a binlog and was keeping up with writes with the same hardware.

After some investigation it turned out that the Fivetran MySQL instance required us to send the replication events in ROW format. The master was using MIXED format. This meant the SQL thread on the replica was performing significantly more disk writes than the master. We use AWS EBS as the disk, so these additional writes were particularly expensive as they are made over a LAN.

We were still under our provisioned EBS IOPS, but since all the writes were coming from a single threaded process, we couldn’t effectively make use of any parallelism in the OS or hardware.

So how did we solve the problem?

We realized that by default, RDS MySQL flushes the binlog updates to disk on each write. This makes a lot of sense for point-in-time recovery: if the database crashes before the binlog entry is written to disk, then that write will not make it into a new database if a point-in-time restore is performed.

However for our replicas we don’t need synchronous writes to disk. If we want to restore a replica, we will do that from a master snapshot. Luckily the MySQL sync_binlog option lets you turn off flushing on each write, and instead leaves it to the OS.

You can set this in the replica’s MySQL config file (e.g. /etc/mysql/my.cnf ) with

[mysqld] sync_binlog = 0 # ...

Or dynamically by running the query

SET GLOBAL sync_binlog = 0 ;

This is what we did, and our database rapidly crunched through all of the lag it had accumulated!

Thanks to Jim Fisher for reviewing this and helping with the introduction.