Event sourcing. Low level approach. XA transactions

Vadim: Sounds really great. Actually, we have some time and I would like to discuss one of your recent articles about online event processing. You’re a great supporter of the idea of event sourcing, is that correct?

Martin: Yes, sure.

Vadim: Nowadays this approach is getting momentum, and in the pursuit of all the advantages of globally ordered log of operations, many engineers try to deploy it everywhere. Could you please describe some cases where event sourcing is not the best option? Just to prevent its misuse and possible disappointment with the approach itself.

Martin: There are two different layers of the stack that we need to talk about first. Event sourcing, as proposed by Greg Young and some others, is intended as a mechanism for data modeling, that is: if you have a database schema and you’re starting to lose control of it because there are so many different tables and they’re all getting modified by different transactions — then event sourcing is a way of bringing better clarity to this data model, because the events can express very directly what is happening at a business level. What is the action that the user took? And then, the consequences of that action might be updating various tables and so on.Effectively, what you’re doing with event sourcing is you’re separating out the action (the event) from its effects, which happen somewhere downstream.

I’ve come to this area from a slightly different angle, which is a lower-level point of view of using systems like Kafka for building highly scalable systems. This view is similar in the sense that if you’re using something like Kafka you are using events, but it doesn’t mean you’re necessarily using event sourcing. And conversely, you don’t need to be using Kafka in order to do event sourcing; you could do event sourcing in a regular database, or you could use a special database that was designed specifically for event sourcing. So these two ideas are similar, but neither requires the other, they just have some overlap.

The case for wanting to use a system like Kafka is mostly the scalability argument: in that case you’ve simply got so much data coming in that you cannot realistically process it on a single-node database, so you have to partition it in some way, and using an event log like Kafka gives you a good way of spreading that work over multiple machines. It provides a good, principled way for scaling systems. It’s especially useful if you want to integrate several different storage systems. So if, for example, you want to update not just your relational database but also, say, a full-text search index like Elasticsearch, or a caching system like Memcached or Redis or something like that, and you want one event to have an updating effect on all of these different systems, then something like Kafka is very useful.

In terms of the question you asked (what are the situations in which I would not use this event sourcing or event log approach) — I think it’s difficult to say precisely, but as a rule of thumb I would say: use whatever is the simplest. That is, whatever is closest to the domain that you’re trying to implement. And so, if the thing you’re trying to implement maps very nicely to a relational database, in which you just insert and update and delete some rows, then just use a relational database and insert and update and delete some rows. There’s nothing wrong with relational databases and using them as they are. They have worked fine for us for quite a long time and they continue to do so. But if you’re finding yourself in a situation where you’re really struggling to use that kind of database, for example because the complexity of the data model is getting out of hand, then it makes sense to switch to something like an event sourcing approach.

And similarly, on the lower level (scalability), if the size of your data is such that you can just put it in PostgreSQL on a single machine — that’s probably fine, just use PostgreSQL on a single machine. But if you’re at the point where there is no way that a single machine can handle your load, you have to scale across a large system, then it starts making sense to look into more distributed systems like Kafka. I think the general principle here is: use whatever is simplest for the particular task you’re trying to solve.

Vadim: It’s really good advice. As your system evolves you can’t precisely predict the direction of development, all the queries, patterns and data flows.

Martin: Exactly, and for those kinds of situations relational databases are amazing, because they are very flexible, especially if you include the JSON support that they now have. PostgreSQL now has pretty good support for JSON. You can just add a new index if you want to query in a different way. You can just change the schema and keep running with the data in a different structure. And so if the size of the data set is not too big and the complexity is not too great, relational databases work well and provide a great deal of flexibility.

Vadim: Let’s talk a little bit more about event sourcing. You mentioned an interesting example with several consumers consuming events from one queue based on Kafka or something similar. Imagine that new documents get published, and several systems are consuming events: a search system based on Elasticsearch, which makes the documents searchable, a caching system which puts them into key-value cache based on Memcached, and a relational database system which updates some tables accordingly. A document might be a car selling offer or a realty advert. All these consuming systems work simultaneously and concurrently.

Martin: So your question is how do you deal with the fact that if you have these several consumers, some of them might have been updated, but the others have not yet seen an update and are still lagging behind slightly?

Vadim: Yes, exactly. A user comes to your website, enters a search query, gets some search results and clicks a link. But she gets 404 HTTP status code because there is no such entity in the database, which hasn’t been able to consume and persist the document yet.

Martin: Yes, this is a bit of a challenge actually. Ideally, what you want is what we would call “causal consistency” across these different storage systems. If one system contains some data that you depend on, then the other systems that you look at will also contain those dependencies. Unfortunately, putting together that kind of causal consistency across different storage technologies is actually very hard, and this is not really the fault of event sourcing, because no matter what approach or what system you use to send the updates to the various different systems, you can always end up with some kind of concurrency issues.

In your example of writing data to both Memcached and Elasticsearch, even if you try to do the writes to the two systems simultaneously you might have a little bit of network delay, which means that they arrive at slightly different times on those different systems, and get processed with slightly different timing. And so somebody who’s reading across those two systems may see an inconsistent state. Now, there are some research projects that are at least working towards achieving that kind of causal consistency, but it’s still difficult if you just want to use something like Elasticsearch or Memcached or so off the shelf.

A good solution here would be that you get presented, conceptually, with a consistent point-in-time snapshot across both the search index and the cache and the database. If you’re working just within a relational database, you get something called snapshot isolation, and the point of snapshot isolation is that if you’re reading from the database, it looks as though you’ve got your own private copy of the entire database. Anything you look at in the database, any data you query will be the state as of that point in time, according to the snapshot. So even if the data has afterwards been changed by another transaction, you will actually see the older data, because that older data forms part of a consistent snapshot.

And so now, in the case where you’ve got Elasticsearch and Memcached, really what you would ideally want is a consistent snapshot across these two systems. But unfortunately, neither Memcached nor Redis nor Elasticsearch have an efficient mechanism for making those kinds of snapshots that can be coordinated with different storage systems. Each storage system just thinks for itself and typically presents you the latest value of every key, and it doesn’t have this facility for looking back and presenting a slightly older version of the data, because the most recent version of the data is not yet consistent.

I don’t really have a good answer for what the solution would look like. I fear that the solution would require code changes to any of the storage systems that participate in this kind of thing. So it will require changes to Elasticsearch and to Redis and to Memcached and any other systems. And they would have to add some kind of mechanism for point-in-time snapshots that is cheap enough that you can be using it all the time, because you might be wanting the snapshot several times per second — it’s not just a once-a-day snapshot, it’s very fine-grained. And at the moment the underlying systems are not there in terms of being able to do these kinds of snapshots across different storage systems. It’s a really interesting research topic. I’m hoping that somebody will work on it, but I haven’t seen any really convincing answers to that problem yet so far.

Vadim: Yeah, we need some kind of shared Multiversion Concurrency Control.

Martin: Exactly, like the distributed transaction systems. XA distributed transactions will get you some of the way there, but unfortunately XA, as it stands, is not really very well suited because it only works if you’re using locking-based concurrency control. This means that if you read some data, you have to take a lock on it so that nobody can modify that data while you have that lock. And that kind of locking-based concurrency control has terrible performance, so no system actually uses that in practice nowadays. But if you don’t have that locking then you don’t get the necessary isolation behavior in a system like XA distributed transactions. So maybe what we need is a new protocol for distributed transactions that allows snapshot isolation as the isolation mechanism across different systems. But I don’t think I’ve seen anything that implements that yet.

Vadim: Yes, I hope somebody is working on it.

Martin: Yes, it would be really important. Also in the context of microservices, for example: the way that people promote that you should build microservices is that each microservice has its own storage, its own database, and you don’t have one service directly accessing the database of another service, because that would break the encapsulation of the service. Therefore, each service just manages its own data.

For example, you have a service for managing users, and it has a database for the users, and everyone else who wants to find out something about users has to go through the user service. From the point of view of encapsulation that is nice: you’re hiding details of the database schema from the other services for example.

But from the point of view of consistency across different services — well, you’ve got a huge problem now, because of exactly the thing we were discussing: we might have data in two different services that depends upon each other in some way, and you could easily end up with one service being slightly ahead of or slightly behind the other in terms of timing, and then you could end up with someone who reads across different services, getting inconsistent results. And I don’t think anybody building microservices currently has an answer to that problem.

Vadim: It is somewhat similar to workflows in our society and government, which are inherently asynchronous and there are no guarantees of delivery. You can get your passport number, then you can change it, and you need to prove that you changed it, and that you are the same person.

Martin: Yes, absolutely. As humans we have ways of dealing with this, for example, we might know that oh, sometimes that database is a bit outdated, I’ll just check back tomorrow. And then tomorrow it’s fine. But if it’s software that we’re building, we have to program all that kind of handling into the software. The software can’t think for itself.

Vadim: Definitely, at least not yet. I have another question about the advantages of event sourcing. Event sourcing gives you the ability to stop processing events in case of a bug, and resume consuming events having deployed the fix, so that the system is always consistent. It’s a really strong and useful property, but it might not be acceptable in some cases like banking where you can imagine a system that continues to accept financial transactions, but the balances are stale due to suspended consumers waiting for a bugfix from developers. What might be a workaround in such cases?

Martin: I think it’s a bit unlikely to stop the consumer, deploying the fix and then restart it, because, as you say, the system has got to continue running, you can’t just stop it. I think what is more likely to happen is: if you discover a bug, you let the system continue running, but while it continues running with the buggy code, you produce another version of the code that is fixed, you deploy that fixed version separately and run the two in parallel for a while. In the fixed version of the code you might go back in history and reprocess all of the input events that have happened since the buggy code was deployed, and maybe write the results to a different database. Once you’ve caught up again you’ve got two versions of the database, which are both based on the same event inputs, but one of the two processed events with the buggy code and the other processed the events with the correct code. At that point you can do the switchover, and now everyone who reads the data is going to read the correct version instead of the buggy version, and you can shut down the buggy version. That way you never need to stop the system from running, everything keeps working all the time. And you can take the time to fix the bug, and you can recover from the bug because you can reprocess those input events again.

Vadim: Indeed, it’s a really good option if the storage systems are under your control, and we are not talking about side effects applied to external systems.

Martin: Yes, you’re right, once we send the data to external systems it gets more difficult because you might not be able to easily correct it. But this is again something you find in financial accounting, for example. In a company, you might have quarterly accounts. At the end of the quarter, everything gets frozen, and all of the revenue and profit calculations are based on the numbers for that quarter. But then it can happen that actually, some delayed transaction came in, because somebody forgot to file a receipt in time. The transaction comes in after the calculations for the quarter have been finalized, but it still belongs in that earlier quarter.

What accountants do in this case is that in the next quarter, they produce corrections to the previous quarter’s accounts. And typically those corrections will be a small number, and that’s no problem because it doesn’t change the big picture. But at the same time, everything is still accounted for correctly. At the human level of these accounting systems that has been the case ever since accounting systems were invented, centuries ago. It’s always been the case that some late transactions would come in and change the result for some number that you thought was final, but actually, it wasn’t because the correction could still come in. And so we just build the system with the mechanism to perform such corrections. I think we can learn from accounting systems and apply similar ideas to many other types of data storage systems, and just accept the fact that sometimes they are mostly correct but not 100% correct and the correction might come in later.

Vadim: It’s a different point of view to building systems.

Martin: It is a bit of a new way of thinking, yes. It can be disorienting when you come across it at first. But I don’t think there’s really a way round it, because this impreciseness is inherent in the fact that we do not know the entire state of the world — it is fundamental to the way distributed systems work. We can’t just hide it, we can’t pretend that it doesn’t happen, because that imprecision is necessarily exposed in the way we process the data.