In our original single-node Postgres architecture, the emails table resided together with all the other tables in the same database. The syncers constantly fetched emails from Gmail and Exchange web servers and inserted them into the primary Postgres database. These writes were streamed to the Read Replica, from which the emails were read by our consumers. The consumers modified these emails to pull out the relevant information that makes most of Affinity’s features possible. These modifications were written back to the primary database. Given the large volume of emails ingested by Affinity, this meant that our reads and writes were dominated by those to the emails table, and growing fast:

Emails processed by Affinity over time

To give some perspective, in the architecture described above, the emails table comprised approximately 75% of our entire database size on disk, while the next biggest table comprised a mere 8%.

Why shard?

Some consequences of such a read-write pattern were:

The cache hit rate for the emails table was 79% on the read replica and slowly decreasing over time. As more users joined the platform, the size of the table and indexes grew over time and less and less of the frequently accessed data was able to fit in memory. Not surprisingly, the decreasing hit rate was accompanied by increasing database swap usage. Both of the above led to degraded performance for queries not just to the emails table but also to other tables, since the other tables and their indexes were fighting for space in the same cache as the emails table.

For a while, we combatted these problems by vertically scaling with bigger and better Postgres RDS instances. But we were soon approaching the upper limit on RDS Postgres memory size, not to mention the increasingly prohibitive costs of continuing to vertically scale. Hence, we considered scaling horizontally instead.

We began by sharding only our emails table and leaving the other tables in the Postgres RDS setup (shown above). The emails table was the best first candidate because:

As described above, it dominated the read-write pattern and suffered the lowest cache hit rate of all our tables. It was involved in few joins with other tables and so moving it to a different datastore meant rewriting only a few queries. By migrating only our biggest table instead of the all the tables, we get most of the wins we are looking for while still having a short feedback loop to encounter pitfalls and get familiar with the new datastore. Equipped with more experience, we can in the future migrate the rest of our database over to the new datastore.

Why Citus?

Our choice for the new datastore was Citus. Reasons being:

Citus uses Postgres under the hood and is simply an extension to plan and distribute queries. Consequently, our developers can continue to work with the same database engine and query language, allowing us to make use of and continue to build on our in-house Postgres expertise. Citus has great documentation and tooling to help us migrate our data from Postgres with no downtime (namely, a feature called Warp, which takes advantage of the logical decoding feature in Postgres 9.4 and up). Citus abstracts out the sharding so that application developers writing queries to the Citus database (mostly) don’t have to think about it. The coordinator node in the Citus cluster rewrites the queries and fetches data from the appropriate shards before aggregating and returning the results back to the client.

Results