Distributed transactions are hard and expensive, if you wonder how to pragmatically handle them in a mid-size project— this article is for you. We will discuss how can we use the Sagas pattern to run a distributed transaction from Elixir on examples that leverage Sage package. As a bonus, you will see how to use Sagas to organize your domain contexts.

What problem are we trying to solve?

Most projects I’ve built are integrated to external systems. It’s how modern development looks like — you implement your domain logic and outsource rest to SaaS services, nobody likes to reinvent the wheel.

Good examples of those services are payment processors and CRMs. Microservices are another one. One can argue that every time you perform more than one state change when they are not covered by a single ACID database — you run a distributed transaction. And we are not talking about large projects that distribute because of scale or the ones people are doing to research, it’s pretty much any small- or mid-size project that uses Stripe (or anything else) to outsource the billing system.

Sage itself was built whilst I was integrating Stripe with one of our projects, with Stripe you create a customer first and then you create a subscription for that customer. But when we failed to create subscription — we should not keep that customer in Stripe and need to delete it to get rid of side effects.

Booking website example

To make the problem more approachable, imagine we are building a trip booking website and charge customer only once when the request is fulfilled.

Here is the happy-case code that leverages with expression syntax:

Happy-case code for our trip booking website

Another requirement would be that we should not hold any bookings if we failed to charge the card, otherwise it would be a bad business for us because we would still pay for those bookings. So our code should be extended to handle that failure:

Example where we use named stages to catch where we got an error

Here we used a simple trick — we wrapped charge call with a tuple, which has stage name and list of side effects which we must take care of if we failed on that stage.

So when charge failed and booking is cancelled. But how it would look like if we want to book a car, a flight and a hotel within the same trip? If one of the bookings failed — we would need to cancel other ones and if we failed to charge a card — we should cancel all of them:

Now we are creating multiple bookings within the same trip

So we get more named stages and we collect side effects manually, which makes error handling large and error prone. And this example doesn’t even handle scenarios were we made a successful booking request and did not receive the response (eg. because of timeout), so we hold the booking without knowing about it.

To handle this edge-case we can’t think of bookings as of a single stage, we must split them and duplicate error handling:

A common question here is how can we delete something if we get a timeout, one of the most common ways — is to search entity by data we have and then delete it. It can be simplified when service you using allows to search on metadata, so we would be able to generate transaction ID and then attribute created entities to make lookup easier.

This code looks too complex, isn’t it? Now imagine how much bigger it would be if we want to explicitly release the authorization so that customers don’t have to wait for timeout to get their money back..

I believe code gets worse because we are dealing with distributed transaction in ad-hoc fashion. And that’s not all downsides, let’s name a few:

Duplication makes code error prone, we may refactor it but it would be still easy to update the logic in one place forgetting about the other ones;

We can’t book concurrently which is bad for our latency;

The syntax to track step on which failure occurred is ugly;

To cover this code test you would need a lot of stubs that inject errors based on attribute values.

One of solutions would be to use two-phase commits, but they don’t scale: in best case O(2n) messages are spawned (and up to O(n²) with retries); it hurts availability because of locks involved. And what is more important, vast majority of services simply don’t support it.

In my opinion, good tool should not only address those issues, but take an additional step forward by giving you a new mental model. And if it makes code better organized — even better.

What is Saga?

Saga is a very simple failure management pattern that originates from 1987’s paper on long running transactions for databases. It’s original use case was implementing long lived transactions without locking the database. Those transactions, by name, can take a while because they should go through a large dataset and make the system unavailable due to various locks that needs to be placed, which is usually not desirable.

A long lived transaction is a Saga if it can be written as a sequence of transactions that can be interleaved with other transactions. The database management system guarantees that either all the transactions in a Saga are successfully completed or compensating transactions are run to amend a partial execution.

What does that mean in practice? A saga is a distributed transaction which takes care of overall consistency of collection of steps that internally perform atomic transactions. Those steps consist of subtransaction and compensation to amend it’s effects.

Compensations are semantically undoing the transaction effects, eg. if you sent an email confirmation you can not “unsend” it, instead — you can send a follow up email with an excuse for the error.

Getting back to our booking website, here is a visualization showing how it would work with Sagas: