Hello hello. You know what, there’ve been times when I thought of data, data structures and databases as the most boring aspect of software development. These have been days when I’ve been even more foolish then I am now. Doesn’t mean that I am no longer a fool though. The thing is, although I’ve never been a huge expert on a data subject, I’ve eventually spent enough time working on data (fairly) intensive distributed systems, to realise how important the understanding of data structures and data storages really is. One thing before we start — when I talk about a “system” or an “application” in this blog post, I mean a service-oriented distributed system. I also somehow assume that most of this system is hosted in a cloud. Sorta.

Mighty wise Kiwi is going to fix this messy storage

Prologue — Why bother?

So let me start with where I was not so long ago. I’ve been an ignorant prick, believe it or not (better believe). Even though most of the time I worked on systems, dedicated to storing, transforming and interpreting information, I still was focused on algorithms and saw data as a mere result of processing user’s input. Then, thanks to some people around me, I’ve learned that some problems that I used to solve with my code can be avoided altogether by applying a good data design. However, that wasn’t it. I dug deeper, and after some few conference talks, tech articles, YouTube videos and heated Slack conversations I’ve realised that data is more than a collection of bites sitting in the DB thanks to my smart algorithms. What’s important to understand about data — it’s a traveller, that has a home that it leaves and a destination that it’s going to. Neither of those is your data storage.

This is my attempt to reach out to those people who are just like the old me and who still believe that a single SQL database is up to any sort of task.

Chapter 1 — Origins of data

So, now where does data come from — asks no-one. I’ll answer. Most often it comes from the oldest and most complex database of all times (sort of) — the Human Brain. I consider a brain being a part of a distributed system. To think about it — our brain is a complicated, yet imperfect, computer that we use every day of our life. It’s only when we need to do something with that data we have in our brain, and/or access data from other peoples’ brains, we start dealing with applications, interfacing digital data stores. After leaving someone’s brain data doesn’t appear in our system immediately. It travels through several nodes — like our system’s UI (CLI, website, client etc.), routers and caches before reaching the distributed application’s back-end. Throughout this journey, the data is being transformed from one form into another to fit every system temporary storing it on its way. All of these data stores — someone’s brain, client machine’s disk and RAM, routers’ caches — they exist somewhere out there. However since we don’t have much control over them, we tend to forget that they do. They are out of our control hence not our problem whatsoever. Tables turn when the data finally arrives in our domain.

Chapter 2 — Handling that data

Data handling is important and so on an so forth

Now that we have all control and responsibility for some data the first question we should ask ourselves — what do we do with it? By the way, I am assuming we need to store it for future use otherwise what’s the point of this blog post? Handling the data seems pretty straight forward though. We just take it, enrich or truncate if required and permanently(ish) store in our system. A fun fact — although this piece of information has originated from inside a human’s brain and then, through a series of transformations, reached our server, that’s that permanent data store where the data will land first we will consider the ultimate source of truth. For us — it is the initial state of things, the axiom. We need to store this ultimate truth efficiently. First, we need to write fast so we can come back with an acknowledgement of the write before the client’s network connection gets cut off. Second, we need to capture the data in all its entirety so we can share pieces of it with other parts of our system (microservices). That means the store where we will put it into must scale well and be performant even when a huge amount of data is stored. Third, ideally, we need to have an ability to capture and advertise changes happened to the information stored in this database in nearly real-time. All of these makes NoSQL data stores such as DynamoDB or MongoDB very good options. Both of them allow insanely fast reads and writes, scale well and provide access to data changes in real-time through a change stream. Another good option might be a stream processing platform — such as Apache Kafka — that I also consider being a data store. I’ll come back to this topic a bit later, Okay? That, in my opinion, concludes the data handling part. We receive it, and we store it. Now it is the time to decide we do it with our newly obtained data from now.

Chapter 3 — Sharing is caring

Normally we collect data for a reason. Our system is a factory, processing raw data into something of the business value. The point is — we want to put the information we’ve got to good use, and for that, we need to make it accessible by various subsystems making our distributed application (I mostly mean microservices of course). Those microservices that are not directly receiving original data going into the source of truth may still be interested in getting some bits of it, so they can incorporate it with their domain’s data (e.g. add customer’s address to a list of locations where customers live), or build their own interpretation of it (e.g. the same person from the sales and accounting prospectives will look differently). Different microservices have different requirements for data availability. To better satisfy these requirements you need to choose wisely which data stores to use with which microservice. For example, a search microservice will surely benefit from using a search index such as Elasticsearch. Whereas an accounting microservice is crying for a Ledger database, and so on. Using one type of database for everything surely sounds convenient and may actually make sense at very early stages. However, by saving you some time in the beginning it will put the heavy technical debt on your shoulders later on. The migration cost may beat the initial implementational cost, so you need to choose the right moment to evolve your monolithic database into something more agile. There’s no perfect solution out there, duh, but this is kinda yesterday’s news.

Chapter 3 — The source of truth. Retrospective.

One thing about the “source of truth” storage system is how fundamental it is for your system’s operability. The data that you’ve captured and saved originally contributes to your system’s state. There’s no way you can recover it if it is gone (provided it is lost for good with all the backups), therefore you need to make sure you capture it early and you capture it well. Same doesn’t apply to data produced based on that original data. In an unlikely case of losing this “tier two” data stores completely, you can (ideally) still rebuild them using the “source of truth”. This fact makes data streams an appealing candidate to a “source of truth” storage. The best part of data streams such as Kafka or Kinesis is the ability to replay your events, and then to restore your other microservice to the most recent state from any backup, by replaying the latest events. Yes, you still need backups. When it comes to the events retention, some may argue that you’ll never need to delete them. Not only this sounds a little bit controversial — although not unreasonable — also may be impossible with Kinesis, for example. Kinesis has a maximum message retention period of seven days. However, since this is the “source of truth” sort of data we want to keep it within our system for the system’s lifetime. Which is doable if we save every message coming into the stream in a separate database, so we can have access to it even after the original message has been deleted from the stream. Note that in this scenario we may consider the “source of truth” being just a vessel for the original data that was sent to us from the outside world. We only need to read from it if this data is required by another microservice — either to create records for a new database or restore an existing one after a crash.

Chapter 4 — The never-ending journey

Information that has once left someone’s brain never stops travelling. At the end of the day, your application accepts, stores and processes data only to create even more data. This journey will never end. Information will go like a stream branching off into smaller streams, creating lakes and glaciers, and at some point merging into a sea within the global ocean of information. This may sound overly poetic, but I couldn’t help this thought. The other thought I cannot help is that data making this information stream takes just too many forms, and serves too many purposes. Therefore there are so many different requirements to it, so that would be too naive to think, that there may be one perfect solution for all our storage system needs.

Epilogue

I have been honest with all of you at the beginning of this blog post when I’ve said: “I am not an expert”. I recommend you to read and listen to experts to broaden your understanding of when, and why, and where, and what data storage to use. The must-read on this subject is Designing Data-Intensive Applications by Martin Kleppmann. Amazon has some docs describing good practices in DynamoDB. Rick Houlihan from AWS also makes awesome talks on DynamoDB design patterns at re:Invent — like this one from 2018. I am sure there are some awesome articles on the internet covering alternatives from the competing clouds. I also encourage you to learn more about streams. Software Engineering Daily podcast has a number of episodes where they talk about Kafka with experts. There is one episode called Kafka Data Pipelines, that I found particularly interesting. Its main topic is somewhat close to what I am trying to communicate to you this time. However, some proper learning materials will help too. Last but not least — Elasticsearch. Oh, there’s so much information you can find about it on the internet. I dunno what to recommend apart from their official guide. Elasticsearch is a very powerful technology, so if you still are unsure whether or not your project can benefit from it — I suggest to try a good course on Udemy or Pluralsight.

That’s it. I have spoken. Thank you, and till the next time. XOXO. Bye 👋.