Suppose you were working on a project that used to be simple CRUD using relational database. The popularity kept growing and growing and scalability became an issue. You therefore decided to rewrite service to CQRS+ES using Akka Persistence. The project is ready to launch — you just need to migrate existing data… but how?

I found a few ways to do that each having different pros and cons. All of them require to write a custom script that scans legacy database and converts it to events or commands. The way you convert entities from legacy database to events/commands depends on how the entities were stored. Most legacy systems store current state of the entity and mutate it. In such case you have no other way and just create commands/events from last state (e.g. create UserCreated events from rows in table users ).

Convert legacy data to raw events and save them in event store directly

It may seem that the simplest way is to convert rows from legacy database to events and save them directly to the event store. This method however, has many tricky parts. The biggest problem is that there is lots of technical data stored in Akka Persistence’s event stores. It’s not a case of simply populating events table. Let’s take a Cassandra journal plugin, being the most popular one, as an example. There are many tables ( messages, metadata, tag_scanning, tag_views, tag_write_progress ). The columns in messages table are persistence_id, partition_nr, sequence_nr, timestamp, timebucket, event, event_manifest, message, meta, meta_ser_id, meta_ser_manifest, ser_id, ser_manifest, tags, used, writer_uuid . As you can see the event is just one of 16 columns you have to populate. It is therefore mandatory to understand the journal plugin internals really well to know what should be stored in which column. This method is relatively dangerous. If you make any mistake (and it’s very easy) when populating any of the tables, some of your events might not be recovered, be recovered out of order, or persistence query stop working. Debugging such issues is really troublesome.

PROS

Highest level of decoupling — no need to use actors. It’s just “simple” READ-CONVERT-WRITE flow. All required dependencies are legacy database driver and event store database driver.

CONS

Lots of repetition (in case of Cassandra)

Lots of technical data

Mistakes can easily lead to fatal consequences

Impossible to easily switch between different journal implementations

Necessity to understand journal internals really well

Breaking changes to journal plugin may require migration modifications

Manually serializing and maintaining compatibility with production deserializer

Convert legacy data to Commands and send them to persistence actors

As you can see the first method has many flaws. They, however, all can be overcome by using persistence actors directly. Add your project as dependency to migration script and instantiate actor system. Convert legacy data to commands and simply send them to actors. Let your actors do the magic of handling commands, creating events and persisting them to event store. You avoid a headache of digging into journal guts and save yourself from making mistakes. In many projects however, command handlers perform side effects, which are not desirable when migrating. A good example are withdrawals triggered by the PerformWithdrawCommand — actor performs call to an external service which does the actual change of the user account balance. You don’t want to withdraw your users money twice, do you? To solve this problem you would have to introduce some kind of isMigration flag to the project which is a little bit nasty and also dangerous (what if you forget to do that in some places?).

PROS

All of the problems from 1st method are not relevant anymore

CONS

Having to handle some commands differently in case of migration — code changes to production code.

Convert legacy data to Events and send them to custom persistent actor with the same persistenceId

Another way is converting rows to events and sending them to custom migration actors that simply persist the events:

It’s very important to have persistenceId and clusterSharding implementations correspond to your normal production actors. Only then events are recovered by correct actors when you run the production system.

PROS

Problems from first and second method are gone

CONS

Having to write custom actor and clusterSharding implementation

Summary

As you can see, migrating data to Akka Persistence is not a trivial process. Each method has different trade offs, but I think in most cases the safest and easiest is third option. Do you know other way of migrating? Share it — we would love to hear about it.