maria-developers team mailing list archive

Thoughts on Global Transaction ID, parallel slave, and multi-source replication

To : maria-developers@xxxxxxxxxxxxxxxxxxx

: maria-developers@xxxxxxxxxxxxxxxxxxx From : Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx>

: Kristian Nielsen <knielsen@xxxxxxxxxxxxxxx> Date : Thu, 09 Aug 2012 11:34:44 +0200

: Thu, 09 Aug 2012 11:34:44 +0200 User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.4 (gnu/linux)

This is a writeup of some thoughts I have on getting Global Transaction ID in MariaDB, as well as some related thoughts on parallel slave and multi-source replication. MySQL 5.6 has global transaction ID now, and I looked into it in some details. Unfortunately, I _really_ did not like what I saw. There are two main problems, in my opinion: 1. The basic design is bad, leading to an excessively complicated implementation. 2. The implementation looks very incomplete/immature, to the point that I am amazed that anyone would push code like this into a main tree. Point 2 means that I really do not want to maintain this piece of **** in MariaDB, though we can hope that things will improve in subsequent releases. Point 1 leads me to suggest that instead of merging the MySQL 5.6 implementation, we could do our own implementation, using a design that is both much simpler, and at the same time more flexible that what is in MySQL 5.6. I will elaborate further on these points in the rest of the mail. ---- With respect to point 2, let me quote this little gem from the MySQL 5.6 code: /* Apparently this code is not being called. We need to investigate if this is a bug or this code is not necessary. /Alfranio */ DBUG_ASSERT(0); /NOTREACHED/ Now, certainly, there are dark corners of the MySQL replication code, and adding a comment/assert like this could be justified if a developer encounters it and finds it may be dead code. But this is _new_ code, added in a new 5.6 release! So the authors of the new feature, by their own statement, have no idea what the new code they are pushing into the main tree is actually doing. That is just insane. I have no idea what the replication team at MySQL is up to these days pulling stunts like this, I have to confess I am disappointed. I found many similar examples in the 5.6 code. I wrote up a more detailed analysis in TODO-171, where I elaborate more on the problems I see with the MySQL 5.6 Global Transaction ID code: https://mariadb.atlassian.net/browse/TODO-171 When the authors do not understand the code themselves, it really does not seem attractive for us to try to support and bugfix it. I also found a lot of what I think are left-over, unsused parts of an even more complicated design that was apparently abandoned (or postponed?), but left in the code, which does not make supporting this any easier. ---- To explain point 1, I first need to mention the MySQL 5.6 feature of parallel slave apply, as I believe that a major factor in the design of MySQL Global Transaction ID has been the need to interoperate with that feature. The MySQL parallel replication slave feature works so that transactions on tables in different databases can be run in parallel with each other (incidentally, there is some indication that the MySQL replication team considers this *the* meaning of parallel slave, which is sad, as it is just one piece of the general problem of replication slave scalability - it helps in some workloads, but it is useless for others). So with parallel slave, transactions on the master can be executed in a different order on the slave. This means that the crucial notion of a slaves current position in the replication stream becomes more complicated - it is not just a single point in the master binlog, rather every slave replication thread can be at a different point in the master binlog. MySQL 5.6 solves this by introducing "Gtid sets". A Gtid is a MySQL 5.6 global transaction ID - a unique, 24-byte identifier for every transaction. The state of a replication slave at any given moment is then the set Gtids of all transactions ever executed on that slave server This notion of having a set of all transactions ever executed really complicates the design. There is a need to represent such a set as basically intervals of "holes" in the Gtid sequence, to avoid the memory representation of such sets growing without bound over time. This in turns introduces the requirement that there can _never_ be holes in the sequence of Gtids generated on a master - if there ever is introduced a hole, slaves will wait around for ever waiting for a transaction containing the missing Gtid. I think recovering from errors with such a concept will be forever a nightmare. I think this basic fundament of the design is the main source of my dislike for the MySQL global transaction ID. It sounds nice on a first look, but once you start to think deeper on the problem it just does not seem to work well. ---- So I think I have a better idea of how to make Global Transaction ID work well with parallel slave based on executing transactions in different databases out-of-order. The MySQL Gtid is defined as a pair of server-id and sequence number. Every master server has a single server-id (despite no less than 128 bits being allocated for this - anyone planning a replication hierarchy with 2**128 servers?). The master generates a single stream of events in the binlog, though streams from other servers may be interleaved with it in case of multi-master. It is then the task of a slave to try to split up the single master stream in to multiple streams that can be executed in parallel. I think this is the root of the problem. The slave is the wrong place to do this splitting! It is on the master that we know how to interpret transactions (ie. that transactions in different databases are / are not independent), so doing this on the slave is harder. And two slaves can in general split differently, which makes it harder for a slave deeper in the replication hierarchy to switch to a new master (that master being itself a slave of a higher-up master), leading to the need for a complex set-of-all-executed-transactions concept. Suppose we instead make the decision on the _master_ on how to split up the replication stream for parallel apply. Instead of a single server-id for the master, we can have multiple - let us call them "Source ID". On the _master_, we will decide that two transactions in different databases are independent, and assign them different Source ID. The slave then receives multiple replication streams, and is free to execute each stream in parallel - or not, at the discretion of the DBA. Transactions _within_ a single stream are always executed in-order, so the slave has no need to remember any sets of all transactions ever executed - all it needs to remember is the sequence number within each stream it has seen - ie. last sequence number for each Source ID. This is a simple and obvious generalisation of the old concept of current binlog position. Downstream slaves see the exact same binlog sequences, each with transactions in the same order (though sequences may be interleaved differently) - so changing masters is simpler, all servers in the hierarchy have the same view. So to implement the same functionality as MySQL 5.6 parallel replication, all we need is on the master to generate Source ID as a hash of the database name (just as MySQL 5.6 assigns transactions to slave threads based on such hash of the database name). And implement a facility on the slave such that transactions with different Source ID can be executed in parallel by multiple slave threads. And now, since we decide on the master, we can provide much more flexibility for the application. We can have _some_ databases db1,db2 be executed in parallel on the slave - while other databases db3,db4 are considered dependent, and not modified out-of-order on the slave. Or the application can declare that two transactions are independent within the same database or even within the same table; for example log record inserts into a table can often be allowed to happen in different order on slaves. Or the DBA can assign a special Source ID to a long-runnign ALTER TABLE, and such ALTER TABLE can then execute independently on slaves to not stall the normal load. All we need is to provide a simple facility (eg. session variable) that applications/users can use to specify Source ID. And then the next step is multi-source replication, which is the much-request feature that one slave can replicate from two or more different masters at the same time. The different masters will just supply different Source IDs, so apart from the need to maintain multiple TCP connections, one to each master, from the rest of the code's point of view multi-source replication is no different, it is just more independent replication streams. So my suggestion is: rather than struggle to try and merge and support the MySQL 5.6 Global Transaction ID and parallel slave, let us instead implement something like the above idea, with multiple Source IDs on the master, and parallel execution of distinct replication streams on the slave. I believe this will be no more work, and the end result will be both simpler and more flexible. Thanks, - Kristian.

Follow ups