Statement-based vs Row-based Replication

Replication as most people know it, has mostly been SQL statement propagation from master to slave. This is known as “statement-based” replication. But there is also another kind of replication that is available, “the row-based replication” and that has quite a lot of benefits. In this post I intend on highlighting the advantages and disadvantages of both the types of replication to help you choose the best one. I also follow up with my own recommendation.

Let’s start off with discussing both the types.

Statement-based Replication

With statement-based replication, every SQL statement that could modify data is logged on the master. Then those SQL statements are replayed on the slaves against the same dataset and in the same context. The statement-based replication corresponds to the statement-based binary logging format.

Row-based Replication

With row-based replication, every “row modification” is logged on the master and is then applied on the slave. The keywords here are “row modification”, which implies that row-based replication is physical, in the sense that SQL statements that change the rows are not recorded, instead the entire updated row is written to the binary log. But there are a few exceptions, when a new table is created, dropped, or altered, the actual SQL statement is recorded. The row-based replication corresponds to row-based binary logging format.

Now that I have defined both the types of replication, let’s start having a look at the advantages and disadvantages of both the approaches.

Advantages of Statement-based Replication

Following are some of the advantages of statement-based replication. All of these advantages come down from the fact that the SQL statements are logged:

There is always less data that is to be transferred between the master and the slave.

There is less space taken up in the update logs.

There is no need to deal with the row format.

Also, auditing the database is easy, because statements that made any changes to the data are all logged in the binary log.

Disadvantages of Statement-based Replication

Following are some of the disadvantages of statement-based replication:

The single biggest disadvantage of statement-based replication is the data-inconsistency issue between the master and the slave that creeps up due to the way this kind of replication works. Because we are logging the SQL statements, it is also necessary to log context information, so that the updates produce the same results on the slave as they did originally on the master. But in some cases it is not possible to provide any such context. Any nondeterministic behavior, is not going to have any such context present and hence is difficult to replicate using statement-based replication.

Let me quote an example here from the MySQL manual:

“For example, for INSERT … SELECT with no ORDER BY, the SELECT may return rows in a different order (which results in a row having different ranks, hence getting a different number in the AUTO_INCREMENT column), depending on the choices made by the optimizers on the master and slave.”

Let me quote an example here from the MySQL manual: “For example, for INSERT … SELECT with no ORDER BY, the SELECT may return rows in a different order (which results in a row having different ranks, hence getting a different number in the AUTO_INCREMENT column), depending on the choices made by the optimizers on the master and slave.” With statement-based replication, you are bound to encounter issues with replicating stored routines or triggers, and hence this kind of replication does not always work with stored routines and triggers.

There is a performance penalty in the case of INSERT … SELECT, because in the case of statement-based replication this kind of statement requires a greater number of row-level locks as compared to row-based replication.

There is a lot of execution context information that is required in order for the updates to produce the same results on the slave as they did originally on the master.

A statement that depends on UDFs or stored routines that are nondeterministic, cannot be replicated properly, since the value returned by such a UDF or stored routine is not always the same, for similar parameters.

Advantages of Row-based Replication

Following are the major advantages of row-based replication:

With row-based replication, each and every change can be replicated and hence this is the safest form of replication.

Because every row update is physically logged, hence there no need for any context information. The only thing that is needed is to know which record is being updated and what is the update that is being written to that record.

There are fewer row locks required on the master, which thus achieves high concurrency.

The problems with auto_increment columns, timestamps, stored routines, and triggers don’t bother us with this kind of replication.

Statements that update very few rows are very fast.

Disadvantages of Row-based replication

Following are the disadvantages of row-based replication:

On a system that frequently updates large number of rows such as,

UPDATE products set status='sold' where product_id BETWEEN 30000 and 50000;

row-based replication produces very large update logs and generates a lot of network traffic between the master and the slave.

row-based replication produces very large update logs and generates a lot of network traffic between the master and the slave. This kind of replication requires a lot of awareness of the internal row format.

In cases of very large updates, the performance overhead associated with the increased I/O required to write large update logs could become unacceptable.

You cannot examine the logs to audit changes to the database, because SQL statements are not logged, instead the data is logged in binary format.

Conclusion

Although both the types have advantages and disadvantages, but for me the advantage that row-based replication offers in terms of data consistency between master and slave, far outweighs any of the disadvantages. Though you might point out the fact that large updates produce large update logs in case of row-based replication, but those cases in real-world would be far and few and not very frequent. Also, stored routines and triggers are increasingly being used after their introduction in MySQL and row-based replication allows us to use them without thinking about them being unsafe for replication. Also everyone loves high-concurrency, don’t they and that is something that row-based replication achieves. And as far as auditing of data updates is concerned you could use “mysqlbinlog” to help you decode the binary logs and figure out the changes to data.

There are also other host of optimizations that are possible due to the way how this kind of replication works, which I will be discussing in a future post.

So my recommendation is go for row-based replication, you are going to love it!