Discord continues to grow faster than we expected and so does our user-generated content. With more users comes more chat messages. In July, we announced 40 million messages a day, in December we announced 100 million, and as of this blog post we are well past 120 million. We decided early on to store all chat history forever so users can come back at any time and have their data available on any device. This is a lot of data that is ever increasing in velocity, size, and must remain available. How do we do it? Cassandra!

What we were doing

The original version of Discord was built in just under two months in early 2015. Arguably, one of the best databases for iterating quickly is MongoDB. Everything on Discord was stored in a single MongoDB replica set and this was intentional, but we also planned everything for easy migration to a new database (we knew we were not going to use MongoDB sharding because it is complicated to use and not known for stability). This is actually part of our company culture: build quickly to prove out a product feature, but always with a path to a more robust solution.

The messages were stored in a MongoDB collection with a single compound index on channel_id and created_at . Around November 2015, we reached 100 million stored messages and at this time we started to see the expected issues appearing: the data and the index could no longer fit in RAM and latencies started to become unpredictable. It was time to migrate to a database more suited to the task.

Choosing the Right Database

Before choosing a new database, we had to understand our read/write patterns and why we were having problems with our current solution.

It quickly became clear that our reads were extremely random and our read/write ratio was about 50/50.

Voice chat heavy Discord servers send almost no messages. This means they send a message or two every few days. In a year, this kind of server is unlikely to reach 1,000 messages. The problem is that even though this is a small amount of messages it makes it harder to serve this data to the users. Just returning 50 messages to a user can result in many random seeks on disk causing disk cache evictions.

Private text chat heavy Discord servers send a decent number of messages, easily reaching between 100 thousand to 1 million messages a year. The data they are requesting is usually very recent only. The problem is since these servers usually have under 100 members the rate at which this data is requested is low and unlikely to be in disk cache.

Large public Discord servers send a lot of messages. They have thousands of members sending thousands of messages a day and easily rack up millions of messages a year. They almost always are requesting messages sent in the last hour and they are requesting them often. Because of that the data is usually in the disk cache.

We knew that in the coming year we would add even more ways for users to issue random reads: the ability to view your mentions for the last 30 days then jump to that point in history, viewing plus jumping to pinned messages, and full-text search. All of this spells more random reads!!

Next we defined our requirements:

Linear scalability — We do not want to reconsider the solution later or manually re-shard the data.

We do not want to reconsider the solution later or manually re-shard the data. Automatic failover — We love sleeping at night and build Discord to self heal as much as possible.

We love sleeping at night and build Discord to self heal as much as possible. Low maintenance — It should just work once we set it up. We should only have to add more nodes as data grows.

It should just work once we set it up. We should only have to add more nodes as data grows. Proven to work — We love trying out new technology, but not too new.

We love trying out new technology, but not too new. Predictable performance — We have alerts go off when our API’s response time 95th percentile goes above 80ms. We also do not want to have to cache messages in Redis or Memcached.

We have alerts go off when our API’s response time 95th percentile goes above 80ms. We also do not want to have to cache messages in Redis or Memcached. Not a blob store — Writing thousands of messages per second would not work great if we had to constantly deserialize blobs and append to them.

Writing thousands of messages per second would not work great if we had to constantly deserialize blobs and append to them. Open source — We believe in controlling our own destiny and don’t want to depend on a third party company.

Cassandra was the only database that fulfilled all of our requirements. We can just add nodes to scale it and it can tolerate a loss of nodes without any impact on the application. Large companies such as Netflix and Apple have thousands of Cassandra nodes. Related data is stored contiguously on disk providing minimum seeks and easy distribution around the cluster. It’s backed by DataStax, but still open source and community driven.

Having made the choice, we needed to prove that it would actually work.

Data Modeling

The best way to describe Cassandra to a newcomer is that it is a KKV store. The two Ks comprise the primary key. The first K is the partition key and is used to determine which node the data lives on and where it is found on disk. The partition contains multiple rows within it and a row within a partition is identified by the second K, which is the clustering key. The clustering key acts as both a primary key within the partition and how the rows are sorted. You can think of a partition as an ordered dictionary. These properties combined allow for very powerful data modeling.

Remember that messages were indexed in MongoDB using channel_id and created_at ? channel_id became the partition key since all queries operate on a channel, but created_at didn’t make a great clustering key because two messages can have the same creation time. Luckily every ID on Discord is actually a Snowflake (chronologically sortable), so we were able to use them instead. The primary key became (channel_id, message_id) , where the message_id is a Snowflake. This meant that when loading a channel we could tell Cassandra exactly where to range scan for messages.

Here is a simplified schema for our messages table (this omits about 10 columns).

While Cassandra has schemas not unlike a relational database, they are cheap to alter and do not impose any temporary performance impact. We get the best of a blob store and a relational store.

When we started importing existing messages into Cassandra we immediately began to see warnings in the logs telling us that partitions were found over 100MB in size. What gives?! Cassandra advertises that it can support 2GB partitions! Apparently, just because it can be done, it doesn’t mean it should. Large partitions put a lot of GC pressure on Cassandra during compaction, cluster expansion, and more. Having a large partition also means the data in it cannot be distributed around the cluster. It became clear we had to somehow bound the size of partitions because a single Discord channel can exist for years and perpetually grow in size.

We decided to bucket our messages by time. We looked at the largest channels on Discord and determined if we stored about 10 days of messages within a bucket that we could comfortably stay under 100MB. Buckets had to be derivable from the message_id or a timestamp.

Cassandra partition keys can be compounded, so our new primary key became ((channel_id, bucket), message_id) .

To query for recent messages in the channel we generate a bucket range from current time to channel_id (it is also a Snowflake and has to be older than the first message). We then sequentially query partitions until enough messages are collected. The downside of this method is that rarely active Discords will have to query multiple buckets to collect enough messages over time. In practice this has proved to be fine because for active Discords enough messages are usually found in the first partition and they are the majority.

Importing messages into Cassandra went without a hitch and we were ready to try in production.

Dark Launch

Introducing a new system into production is always scary so it’s a good idea to try to test it without impacting users. We setup our code to double read/write to MongoDB and Cassandra.

Immediately after launching we started getting errors in our bug tracker telling us that author_id was null. How can it be null? It is a required field!

Eventual Consistency

Cassandra is an AP database which means it trades strong consistency for availability which is something we wanted. It is an anti-pattern to read-before-write (reads are more expensive) in Cassandra and therefore everything that Cassandra does is essentially an upsert even if you provide only certain columns. You can also write to any node and it will resolve conflicts automatically using “last write wins” semantics on a per column basis. So how did this bite us?