The Art of Slicing Data

Sharding, is a common way of distributing database load. Simply explaining sharding – if a database contains the data of X users and getting overloaded (due to large size), it is divided in to two databases (two shards) each containing the data of X/2 users. So, with all probabilities, the load will be distributed evenly between the two databases. Note that, when the application has 2X users, each of these databases runs in to the same problem and need to be mutated in to two again. So in a way, the scaling is linear. My intention behind this post is – Since we are taking the pain to divide a single convenient database to two (or multiple), is there a possible sharding strategy which will perform better than the linear scaling in performance?. Of course, if this better strategy (if any) would apply to all type of applications (like caching), we would not have been discussing this here. Thus, while reading the post, please mark the inherent assumptions.

As we proceed, we will encounter nothing new, but the terminologies we are familiar with – Archiving, Caching etc. The more I think about it, the difference between these terminologies blur. Statistically, for a typical application, 5% of the data is used 95% of the times and 95% of the data is used only 5% of times. The reason is simple in the context of web applications – they typically display items in reverse chronological order, which is obvious. Now given this, let’s see if slicing the data by time can give us better benefit than linear scaling. Please note the assumptions stated above in which we are trying to find a better solution. If your application does not fall in to the above pattern, it might not be a good solution for you.

With the above background, let’s consider dividing the data in to two parts. One is the Active Shard with recent 5% of data (both entities and relationships) and the other one with Archive Shard with 95% of data. Any query of the application is expected to get results from the Active shard and if required calls the Archived Shard for further result. For example, data for the first page an application is in the active shard and the subsequent pages are in the Archived Shard. It may be required to query both shards at the same time and merge the result set.

By above, what we have achieved is that we are directing 95% of the queries to the shard that has 5% (1/20th) of data as opposed to linear sharding (1/2 of the data). If you have agreed with me peacefully so far, may be you are not paying attention! I have assumed that the recent 5% of data is the exact 5% of data where 95% queries will be going. Though the data is reverse chronologically sorted, the assumption might be incorrect for some cases. Again, go back and think through if splitting the data by recent 5% will be a good approximation for your application. May be the recent 5% data per user is a good start ! Anyways, moving forward, an obvious question is what about the 5% of the queries that are getting fired over a shard that has 95% of data! Will the 10th page of my application will take for ever to load?

The answer to the above is that since now there are fewer queries on the large Archived Shard (both read and write), different set of database optimizations can come in to play and again it is a different but relatively simple design problem. Possible solutions can be –

Cache – Putting a caching layer (like memcache) before the Archived database. Since there are fewer selects and updates, a relatively high timeout can be set on the cache expiry. Even strict timeout cache (in case of updates) will work good since there will be fewer updates

Database optimization – Both shards here need to be optimized in different ways. Optimize the Active shard for low sort, read buffer, higher threads (Example: for Mysql look here). Optimize the Archive shard for high sort, read buffer but fewer threads. Several other things like Query Cache in case of Mysql will offer significant value on Archived database as number of updates are less (the query cache will remain valid for long).

Pre-Calculation – Since number of updates are less on the Archived Shard, a lot of functions can be pre-calculated. For example, if a query is accessing a large result set sorted by some parameter and accesses the data not in the first few items (limit 10000,100), it results in a performance issue. In fact its a wel-known DOS (denial of service) attack. In our current design, this data is probably going to lie in the archived shard. Since updates are less, the data can be pre-paginated easily. I am thinking of posting a detailed design of the same later. On the similar lines, more efficient methods can be devised to exploit the slow update characteristics on the Archived Shard.

The above illustrations of 2 shards can be extended by partioning the Archived shard in to multiple Archived shards. The readers may extend the thought to create their own hybrid sharding strategy. Several directions on optimizations can also evlove. For example, if an entity is created in ith shard out of n time sliced shards, then all relationships exist only from ith shard to the nth shard, hence any query requesting for the relationships of that entity need not be fired to the shards from 0 to i-1. There are of course pros and cons to the given approach and a traditional approach. I will leave that up to the readers to analyze in the context of their problem in hand.