Don't Let the Internet Dupe You, Event Sourcing is Hard

published 2019-02-03

I'm going to give it to you straight: event sourcing actually comes with drawbacks. If you've read anything about the topic on the internet this will surely shock you. After all, it's commonly sold as one big fat bag of sunshine and rainbows. You got some kind of a problem? Turns out its actually solved by event sourcing. In fact, most of your life troubles up till now were probably directly caused by your lack of event sourcing.

You, having been seduced by the internet, are probably off to start your event sourcing journey and begin living the good life. Well, before you do that, I'm here to ruin it for you and tell you that event sourcing is not actually a bag filled with pure joy, but instead a bag filled with mines designed to blow your legs off and leave you to a crippled life filled with pain.

Why would I say such things? Because I'm a guy who previously drank the juice, had the power to make design calls, and took a team down the path of building an event sourced system from scratch. After an aggressive year of deploying a complex application, I've collected a lot of scars, bruises, and lessons learned. Below are my opinions, unexpected hurdles, bad assumptions, bad understandings, after growing an Event Sourced application.

Preface

To be clear, this is not a "you should never event source", or an "event sourcing is the worst thing ever", this is just a collection of the unexpected costs and problems that popped up while putting an event sourcing powered system into production. The bulk of these probably fall under "he obviously didn't understand X," or "you should never do Y!" in which case you would be absolutely right. The point of this is that I didn't understand the drawbacks or pain points until I'd gotten past the "toy" stage.

Without further ado...

The core selling point of Event Sourcing is largely an anti-pattern

In my humble, opinion, of course

The big Event Sourcing "sell" is the idea that any interested sub-systems can just subscribe to an event stream and happily listen away and do its work. Y'know, this picture, that you'll find in pretty much any Event Sourcing Intro:

Image via: Microservices with Clojure

In practice, this manages to somehow simultaneously be both extremely coupled and yet excruciatingly opaque. The idea of a keeping a central log against which multiple services can subscribe and publish is insane. You wouldn't let two separate services reach directly into each other's data storage when not event sourcing – you'd pump them through a layer of abstraction to avoid breaking every consumer of your service when it needs to change its data – However, with the event log, we pretend this isn't the case. "Reach right on in there and grab those raw data events", we say. They're immutable "facts" after all. And Immutable things don't change, right? (*cough* no *cough*)

In effect, the raw event stream subscription setup kills the ability to locally reason about the boundaries of a service. Under "normal" development flows, you operate within the safe, cozy little walls which make up your service. You're free to make choices about implementation and storage and then, when you're ready, deal with how those things get exposed to the outside world. It's one of the core benefits of "services". However, when people are reaching into your data store and reading your events directly, that 'black box' property goes out the window. Coordination can't be bolted on later, you have to talk to the people who will be consuming the events you produce to ensure that the events include enough data for the consuming system to make a decision.

If you fight through the above obstacle and manage to successfully wire a fleet of services together via an event stream, you'll be rewarded with a new problem: opacity. With multiple systems just reading an event stream sans any coordination layer, how these system actually work and connect together will eventually be completely baffling. You've basically got all the problems that come with Observer heavy code, but now on the system level. Control becomes inverted in a way that makes it difficult to reason about how data actually flows through the systems, or which systems consume / produce events, or care if they're added / removed / modified, etc.. etc..

Now, to be fair, Greg Young has a talk where he mentions these problems and advocates for solving them via Process Managers or simple Actor based setups i.e. introducing something which can serve as central coordination point which can route events. However, I didn't see that talk until much later. I went in thinking that ledgers would rule the world, and had to slowly discover the need for this meta management layer by painfully bumping into all the bits that don't work with the Event Sourcing setup as commonly sold.

The upstart costs are large

Event Sourcing is not a "Move Fast and Break Things" kind of setup when you're a green field application. It's a more of a "Let's all Move Slow and Try Not to Die" sort of setup. For one, you're probably going to be building the core components from scratch. Frameworks in this area tend to be heavy weight, overly prescriptive, and inflexible in terms of tech stacks. If you want to get something up in running in your corporate environment with the tech available to you today, rolling your own is the way to go (and a suggested approach!).

While this path is honestly a ton of fun, it's also super time consuming. It will all be time which is not being spent making actual forward progress on your application. Entire sprints will be lost to planning out how you deploy things on the infrastructure available, how to ensure streams behave, messages get processed, how failures will be retried, and then you've got to actually go about implementing it, learning what sucks about your choices, implementing it again with your newly gained knowledge, until you end up with a solid enough foundation upon which you can actually begin to build the application in question.

And once you're into the implementation stage, you'll realize something else: the shear volume of plumbing code involved is staggering. Instead of your friendly N-tier setup, you've now got classes for commands, command handlers, command validators, events, aggregates, AND THEN your projections, those model classes, their access classes, custom materialization code, and so on. Getting from zero to working baseline requires significant scaffolding. Now, admittedly, how much this hurts is somewhat language dependent, but if you're is an already verbose language like Java (like I was), your fingers will be tired at the end of each day.

As a final point on the Getting Started side of things, there's a certain human / political cost involved. Getting an entire development team onboard philosophically is non-trivial. There will be those excited by the idea who read up on it outside work and are down for riding out the growing pains involved in trying alternative development methodologies, and then there will be those who aren't into it at all. However, regardless of which "camp" a person is in, disagreements will still mount as everyone tries to figure out how best to build a maintainable a system under a foreign methodology with unclear best practices.

These team problems can additionally creep outside of your immediate development group. Getting tertiary members like UX involved presents its own challenges. Which leads to the unexpected point of...

Event sourcing needs the UI side to play along

This one, while obvious in retrospect, caught me by surprise. If you have a UI, it generally needs to play along with the event driven aspect of the back end. Meaning, it should be task based. However, the bulk of common UI iterations aren't designed that way. They're static and form based. Which means you end up with a massive impedance mismatch between the back-end, which wants small semantic events, and the front-end, which is giving you fat blobs of form data.

A common response to would be the argument that maybe those heavy form driven parts of the application shouldn't be written to a ledger at all – let CRUD be CRUD, and that's an interesting argument, which brings me to..

You'll potentially be building two entirely different systems along side each other

A super common piece of advice in the ES world is that you don't event source everywhere * . This is all well and good at the conceptual level, but actually figuring out where and when to draw those architectural boundaries through your system is quite tough in practice.

The core reason is that the requirements that likely led you to Event Sourcing in the first place generally don't go away just because some parts of your application are more "CRUD-y". If you still need to audit your data, do you build out a totally different audit strategy for those non-event driven parts, or just reuse the ledger setups you've already deployed and tested? What about communication with other systems? Do you build out new communication channels, or reuse the streaming architecture already in place?

There's no clear answer because no path is ideal. Each one comes with its own pain points and draw backs.

* ...although this flies in the face of other advice like "only CRUD when you can afford it"

Past system states from the audit Log will often have fidelity problems

Unless you're willing to go into crazy person territory.

Software changes, requirements change, focuses shift. Those immutable "facts," along with your ability to process them, won't last as long as you expect.

We made it about a month before a shift in focus caused us to hit our first "oh, so these events are no longer relevant, at all?" situation. Once you hit this point, you've got a decision to make: what to do with the irrelevant / wrong / outdated events.

Do you keep the now deprecated events in the ledger, but "cast" them up to new events (or no-op s) during materialization, or do you rewrite the ledger itself to remove/cast the old events? The best practices in this area are often debated.

Regardless of which path you take, as soon as you take it, you've lost the ability to accurately produce the state of your system at the point in time of the rewrite. (unless you have the deep character flaws required to do something completely psychotic, of course).

So, the often sold idea of a "100% accurate audit log" and "easy temporal queries!" ends up suffering from a case of "nope" once you get past the conceptual / toy stage and bump into the real world. If you've sold your magical log idea to stake holders, this fidelity loss over time could pose issues depending on your domain.

The audit log is often too chatty for direct use

This one is obviously very business / use case dependent, but having a full low-level audit log of every action in the application was often more of a hindrance than a help. Meaning, most of it ends up being pure noise that actually needs filtered out, both by end users, and by consuming sub-systems. All of those transient "Bob renamed field x to y" are seldom of interest. If you're showing the audit log to an end user, more often than not, discrete logical states are of far more value than transient intermediates. So, the "free audit log" actually turns into "tedious projection writing." For downstream systems, this chattiness causes similar coordination woes. "When should I actually run?" and "should I care about event X?" was a common question during design meetings. It's all in the class of problems that require either Process Managers or the introduction of queues to solve.

The audit log as a debugging tool considered: over hyped

Minor, but worth pointing out: another touted benefit to being ledger based is that it helps with debugging. "If you find a bug in your application, you can replay the log to see how you got into that state!" I'm yet to see this play out. 99% of the time "bad states" were bad events caused by your standard run-of-the-mill human error. No different than any other "how did that get in the database?" style problem. Having a ledger provided little value over your normal debugging intuition when using a standard db set. Meaning, if an age field was corrupt, you'd probably know which code to start investigating.

Projections are not actually free

"You're no longer bound to a single table structure", says Event Sourcing. If you need a different view of your data, just materialize the event log in a new way. "It's so easy!"

In practice, this is expensive both in terms of initial development cost and ongoing maintenance. That first extra projection you add doubles the amount of code that touches your event stream. And odds are, you'll be writing more than one projection. So now you have N things processing this event stream instead of 1 thing. There's no more DRY from this point forward. If you add, modify, or remove an event type, you're on the hook for spreading knowledge of that change to N different places.

You'll deal with materialization lag:

Once your data grows to the point where you can no longer materialize from the ledger in a reasonable amount of time, you'll be forced to offload the reads to your materialized projections. And with this step comes materialization lag and the loss of read-after-write consistency.

Information is now either outdated, missing, or just wrong. Newly created data will 404, deleted items will awkwardly stick around, duplicate items will be returned, you get the gist. Basically all the joys of the eventual part of consistency.

Individually, they're not a huge deal, but these are still things you have to spend time solving. Do you bake in a fall-back strategy for reads? Do you spend time adding smarts to the materialization itself in order to make it faster? Do you write logic to allow the caller to request the type of read they want (i.e. ledger, at the cost of latency, or projected, at the cost of consistency)?

There are a ton of ways to solve it. But you having to solve it is the key thing I'm getting at here. This is time that needs to be accounted for, planned, implemented, and deployed (all at the expense the thing you're supposed to be solving!).

Finally: You won't really know the pain points until you're past the toy level.

This is just the reality of maintaining any long-lived software. Regardless of how much you try to prepare, how much background reading you do, or how many prototypes you build, you're doing something totally new. The problems that cause the most pain won't manifest themselves in small test programs. It's only once you have a living, breathing machine, users which depend on you, consumers which you can't break, and all the other real-world complexities that plague software projects that the hard problems in event sourcing will rear their heads. And once you hit them, you're on your own.

So what now?

Event Sourcing isn't all bad, my complaint with it is just that it is wildly over sold as a cure all and rarely are any negative side-effects talked about. I still really like the ideas from event sourcing, it's just that putting it into practice caused more pain than I would have otherwise liked.

What's the take away here? Should I event source or not!?

I think you can generally answer it with some alone time, deep introspection, and two questions:

For which core problem is event sourcing the solution? Is what you actually want just a plain old queue?

If you can't answer the first question concretely, or the justification involves vague hand-wavy ideas like "auditablity", "flexibility," or something about "read separation": Don't. Those are not problems exclusively solved by event sourcing. A good ol' fashion history table gets you 80% of the value of a ledger with essentially none of the cost. It won't have first class change semantics baked in, but those low-level details are mostly worthless anyway and can ultimately be derived at a later date if so required. Similarly CQRS doesn't require event sourcing. You can have all the power of different projections without putting the ledger at the heart of your system.

The latter question is to weed out confused people like myself who thought the Ledgers would rule the world. Look at the interaction points of your systems. If you're going full event sourcing, what events are actually going to be produced? Do those downstream systems care about those intermediate states, or will it just be noise that needs to be filtered out? If the end goal is just decoupled processes which communicate via something, event sourcing is not required. Put a queue between those two bad boys and start enjoying the good life.

Edit: 2018-02-05:

This spawned lots of interesting discussion on Hacker News and Reddit. Check it out!