I talk with lot of people who are really interested in Percona XtraDB Cluster (PXC) and mostly they are interested in PXC as a high-availability solution. But, what they tend not to think too much about is if moving from async to synchronous replication is right for their application or not.

Facts about Galera replication

There’s a lot of different facts about Galera that come into play here, and it isn’t always obvious how they will affect your database workload. For example:

Transaction commit takes approximately the worst packet round trip time (RTT) between any two nodes in your cluster.

Transaction apply on slave nodes is still asynchronous from client commit (except on the original node where the transaction is committed)



Galera prevents writing conflicts to these pending transactions while they are inflight in the form of deadlock errors . (This is actually a form of Eventual Consistency where the client is forced to correct the problem before it can commit. It is NOT the typical form of Eventual Consistency, known as asynchronous repair, that most people think of).

Callaghan’s Law

But what does that all actually mean? Well, at the Percona Live conference a few weeks ago I heard a great maxim that really helps encapsulate a lot of this information and puts it into context with your application workload:

[In a Galera cluster] a given row can’t be modified more than once per RTT

This was attributed to Mark Callaghan from Facebook by Alexey Yurchenko from Codership at his conference talk. Henceforth this will be known as “Callaghan’s law” in Galera circles forever, though Mark didn’t immediately recall saying it.

Applied to a standalone Innodb instance

Let’s break it down a bit. Our unit of locking in Innodb is a single row (well, the PRIMARY KEY index entry for that row). This means typically on a single Innodb node we can have all sorts modifications floating around as long as they don’t touch the same row. Row locks are held for modifications until the transaction commits and that takes an fsync to the redo log by default, so applying Callaghan’s law to single-server Innodb, we’d get:

[On a single node Innodb server] a given row can’t be modified more than the time to fsync

You can obviously relax that by simply not fsyncing every transaction (innodb_flush_log_at_trx_commit != 1), or work around it with by fsyncing to memory (Battery or capacitor-backed write cache), etc., but the principle is basically the same. If we want this transaction to persist after a crash, it has to get to disk.

This has no effect on standard MySQL replication from this instance, since MySQL replication is asynchronous.

What about semi-sync MySQL replication?

It’s actually much worse than Galera. As I illustrated in a blog post last year, semi-sync must serialize all transactions and wait for them one at a time. So, Callaghan’s law applied to semi-sync is:

[On a semi-sync replication master] you can’t commit (at all) more than once per RTT.

Applied to a Galera cluster

In the cluster we’re protecting the data as well, though not by ensuring it goes to disk (though you can do that). We protect the data by ensuring it gets to every node in the cluster.

But why every node and not just a quorum? Well, it turns out transaction ordering really, really matters (really!). By enforcing replication to all nodes, we can (simultaneously) establish global ordering for the transaction, so by the time the original node gets acknowledgement of the transaction back from all the other nodes, a GTID will also (by design) be established. We’ll never end up with non-deterministic ordering of transactions as a result.

So this brings us back to Callaghan’s law for Galera. We must have group communication to replicate and establish global ordering for every transaction, and the expense of doing that for Galera is approximately one RTT between the two nodes in the cluster that are furthest apart (regardless of where the commit comes from!). The least amount of data we can change in Innodb at a time is a single row, so the most any single row can be modified cluster-wide is once per RTT.

What about WAN clusters?

Callaghan’s law applies to WAN clusters as well. LANs usually have sub-millisecond RTTs. WANs usually have anywhere from a few ms up to several hundred. This really will open a large window where rows won’t be able to be updated more than just a few times a second at best.

Some things the rule does not mean on Galera

It does NOT mean you can’t modify different rows simultaneously. You can.

It does NOT mean you can’t modify data on multiple cluster nodes simultaneously. You can.

It does NOT set an lower bound on performance, only a upper bound. The best performance you can expect is modifying a given row once per RTT, it could get slower if apply times start to lag.

So what about my application?

Think about your workload. How frequently do you update any given row? We call rows that are updated heavily “hotspots“.

Examples of hotspots

Example 1: Your application is an online game and you keep track of global achievement statistics in a single table with a row for each stat; there are just a few hundred rows. When a player makes an achievement, your application updates this table with a statement like this:

UPDATE achievements SET count = count + 1 where achievement = 'killed_troll'; 1 UPDATE achievements SET count = count + 1 where achievement = 'killed_troll' ;

How many players might accomplish this achievement at the same time?

Example 2: You have users and groups in your application. These are maintained in separate tables and there also exists a users_groups table to define the relationship between them. When someone joins a group, you run a transaction that adds the relationship row to users_groups, but also updates groups with some metadata:

BEGIN; INSERT INTO users_groups (user_id, group_id) VALUES (100, 1); UPDATE groups SET last_joined=NOW(), last_user_id=100 WHERE id=1; COMMIT; 1 2 3 4 BEGIN ; INSERT INTO users_groups ( user_id , group_id ) VALUES ( 100 , 1 ) ; UPDATE groups SET last_joined = NOW ( ) , last_user_id = 100 WHERE id = 1 ; COMMIT ;

How often might multiple users join the same group?

Results

In both of the above examples you can imagine plenty of concurrent clients attempting to modify the same record at once. But what will actually happen to the clients who try to update the same row within the same RTT? This depends on which node in the cluster the writes are coming from:

From the same node: This will behave just like standard Innodb. The first transaction will acquire the necessary row locks while it commits (which will take the 1 RTT). The other transactions will lock wait until the lock(s) they need are available. The application just waits in those cases.

From other nodes: First to commit wins. The others that try to commit AFTER the first and while the first is still in the local apply queue on their nodes will get a deadlock error.

So, the best case (which may not be best for your application database throughput) will be more write latency into the cluster. The worst case is that your transactions won’t even commit and you have to take some action you normally wouldn’t have had to do.

Workarounds

If your hotspots were really bad in standalone Innodb, you might consider relaxing the fsync: set innodb_flush_log_at_trx_commit to something besides 1 and suddenly you can update much faster. I see this tuning very frequently for “performance” reasons when data durability isn’t as crucial. This is fine as long as you weigh both options carefully.

But in Galera you cannot relax synchronous replication. You can’t change the law, you can only adapt around it, but how might you do that ?

Write to one node

If your issue is really the deadlock errors and not so much the waiting, you could simply send all your writes to one node. This should prevent the deadlock errors, but will not change the lock waiting that your application will need to do for hotspots.

wsrep_retry_autocommit

If your hotspots are all updates with autocommits, you can rely on wsrep_retry_autocommit to auto-retry the transactions for you. However, each autocommit is retried only the number of times specified by this variable (default is 1 retry). This means more waiting, and after the limit is exceeded you will still get the deadlock error.

This is not implemented for full BEGIN … COMMIT multi-statement transactions since it cannot be assumed that those are not applying application logic in between the statements that is not safe to retry after the database state changes.

retry deadlocks

Now we start to get into (*gasp*) territory where your application needs to be modified. Generally if you use Innodb, you should be able to handle deadlock errors in your application. Raise your hands if your application has that logic (I usually get less than 5 people who do out of 100).

But, what to do? Retrying automatically, or giving your end user a chance to retry manually are typical answers. However, this means more latency waiting for a write to go through, and possibly some poor user experience.

batch writes

Instead of updating global counters one at a time (from Example 1, above), how about maintaining the counter in memcache or redis and only flushing to the database periodically?

if( $last_count % 100 == 0 ) { $db->do( "UPDATE achievements SET count = $last_count where achievement = 'killed_troll'"; } 1 2 3 if ( $last_count % 100 == 0 ) { $db -> do ( "UPDATE achievements SET count = $last_count where achievement = 'killed_troll'" ; }

change your schema

In Example 2, above, how above moving the ‘joined’ column to the users_groups table so we don’t need to update the parent group row so often?

INSERT INTO users_groups (user_id, group_id, joined) VALUES (100, 1, NOW()); 1 INSERT INTO users_groups ( user_id , group_id , joined ) VALUES ( 100 , 1 , NOW ( ) ) ;

Conclusion

Choosing a system to replicate your data to a distributed system requires tradeoffs. Most of us are used to the tradeoffs we take when deploying conventional stand-alone MySQL Innodb with asynchronous slaves. We may not think about the tradeoffs, but we’re making them (anyone obsessively testing slave position to ensure it’s caught up with the master?).

Synchronous replication with PXC and Galera is no different in that there are trade-offs, they just aren’t what we commonly expect.

If Callaghan’s law is going to cause you trouble and you are not prepared to adapt to work with it, PXC/Galera Synchronous replication is probably not right for you.