Photo by Jakub T. Jankiewicz, distributed under CC BY-SA 2.0

Incident

In a project far far away our Customer was using MariaDB as a RDBMs for crucial service that was used by all other core business services. It was behaving nicely on production until one day.

On that day we had a massive problem with part of that application that was sending information to Kafka topic. The batch job started slowing down significantly, batch job was logically a part of the crucial service but physically it was a batch job, java application that was running on separate machine. DB queries and modifications for that job were timing out.

To make things more complicated, we noticed that other parts of the crucial service were behaving just fine. After a couple of hours, problems suddenly disappeared.

Investigation

We started the investigation, fortunately we had lots of metrics on Grafana, from hosts CPU load to custom app metrics. After excluding all other possibilities we discovered spike in InnoDB metric named “History length”. The spike start time and end time were correlated with the time of the incident. Actually the spike started a couple seconds before the incident, so it might have been the possible cause.

not that particular spike but the similar one that occured another time

After research and code review we discovered why the batch job was special. It was using “read committed” transaction isolation level which was fine for that batch job, the problem is that the default transaction isolation level in MariaDB/MySql is “repeatable read”. What does repeatable read mean? Due to documentation “Consistent reads within the same transaction read the snapshot established by the first read”. So InnoDB is actually remembering select results for lasting transactions. How is this achieved?

This is implemented by keeping old versions of rows as they are modified. They’re kept in a linked list. The most recent version points to the previous one, which points to the previous one, and so on. Each version has a transaction ID with it so when InnoDB goes looking for a row, it compares your current transaction ID to the row version and selects the right one for you. This can be done without locking. [3]

So if the SQL statement was made on the table that is frequently being changed, the history length will rapidly grow. The usual level for that metric for our service is between 1000–2000, during the incident it was rising continuously, it achieved level over 300k. So we knew that the cause of the spike was a very long SQL transaction with repeatable read isolation level (actually it’s a common issue, as Google search results suggests). Then we discovered that the transaction was made by one of the developers to make some statistics for the board of directors. The query wasn’t heavy and it finished shortly, but the SQL transaction wasn’t closed afterwards, the transaction was closed many hours later probably when the sql client was closed.

What surprised us, is the fact that the huge history length affected in such way only the transactions with read committed isolation level not other transaction with repeatable read. If you have some good explanation, please leave a comment. My best guess is those transactions were affected too but not on such scale.

Solution

The first action point was information to that particular developer and wiki page for other people working with that database, that if they are working on reports or just executing selects on production DB, they should use read committed tx insolation level as well as read only flag, to minimize their performance footprint.

set session transaction isolation level READ COMMITTED;SET SESSION TRANSACTION READ ONLY;

The second step was adding an alert for transaction history length metric. We set it up on 10k. After that incident it showed up a few times, but we were able to react prior things got serious.

References:

[1] Transaction isolation levels description in MySql documentation https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html

The posts that helped us understand the cause