Cassandra is a super fast and scalable database. In the right context, this statement is more or less true. Of course, our context is how we are using this database. And believe me, if you have never used distributed databases before, this would be a completely different experience. Many people argue that Cassandra is actually not that fast when it comes to reads. Well, in many cases that’s true, and in many cases this is only an effect of a wrong understanding of this tool. For sure it is hard to compete with Cassandra when you are comparing write performance, so let’s deep dive into the write path.

There is no point of describing this concept again, so if you are not familiar with Cassandra’s architecture, just check resources below, otherwise skip to the next section.

Why are writes so fast?

What if I told you that when you get the response from Cassandra, that your write operation was successful, your data had been stored only in memory even for 10 seconds (the default setting) before it was flushed to the commit log? Keep in mind, that commit log guarantees durability in case of a node restart or failure. Don’t believe me? Just read the documentation carefully.

With a quite small load for Cassandra, like 10 000 writes/s, you can end up with 100 000 business operations stored, once again, only in memory, just like in Redis :)

Before you start panicking and moving all your data to a good old RDBMS, hold on. Cassandra is a distributed DB. You should never launch it on production with a single node setup. If you have reasonable number of nodes in different racks/availability zones, you are pretty much fine, but still — you can never be 100% sure that you won’t loose any data. This is how it works, nothing comes for free, especially in terms of performance. To be honest, if you think that your RDBMS is safer to use when it comes to durability, you are just lying to yourself. Data durability and replication can be a topic of many other blog posts, but our main focus for today is the Cassandra write path.

Sequential updates

Let’s examine the simplest possible scenario — a few updates for the same primary key, one after another.

How would this be persisted to the SSTable? You can verify it thanks to the sstabledump util. Don’t forget to flush data to disk before using it.

Nothing special, last write wins, everything as expected.

Delayed updates

How would this behave if we update the order status after a week? To simulate this let’s flush after each insert statement. Also, I tweaked a little bit the default compaction strategy to be less aggressive:

ALTER TABLE orders WITH compaction = {‘class’ : ‘SizeTieredCompactionStrategy’, ‘min_threshold’ : 6 };

Now we have 4 SSTables:

and our updates for second order are spread across 3 of them:

In details:

In other words, to provide a result for such query:

Cassandra needs to merge information about the status column from 3 different files. Of course, the more SSTables Cassandra needs to reach to answer a query, the slower the overall query performance is.

Idempotent writes

Idempotent writes are one of my favourite Cassandra features. Why? Long story short, with idempotent writes you can achieve effectively-once delivery semantics and, in a scalable distributed system, this is a crucial attribute.

Are idempotent writes handled differently than any other write operation?

Of course not:

In this case, the exact same information is duplicated across 3 SSTable files. Is it a problem? In most cases not, because write attempts with the same statement occur within a short time range, so they are compensated in memory, before writing to disk. On the other hand, if you need to reprocess some old data you trade this data duplication (at least till next compaction) for writes idempotency and you don’t have to care about duplicated writes.

Tombstones

Cassandra tombstones are pretty famous, you can find a dozen of articles about their issues I won’t go into the details here, but show a short summary of 2 types of tombstones instead.

Cell level

What’s in mc-11-big-Data.db ?

After compacting:

Partition level

After compacting:

Depending on your table schema, you can also observe row level tombstones and range tombstones. I encourage you to check this by trying them on your own.

What is important to remember about tombstones is that instead of removing data, we store even more information, usually in different SSTables and this data is removed (or comes back as a zombie) after gc_grace_seconds (10 days by default).

Summary

The main takeaway after reading this post should be that your read performance problems start with your writes. Although Cassandra uses very fancy mechanisms for optimizing the read path, the rule of thumb is simple: keep your partition on a single SSTable! If that’s not possible, try to minimize the spread ratio. Disk seek is one of the slowest operations while reading. Before you start tuning Cassandra JVM configuration, check different compaction strategies. Validate your configuration and schema design with SSTable utilities. And of course, don’t ignore a good monitoring system that can signal potential problems before they hit a critical level.