In order to store massive amounts of data in your Azure Cosmos DB database, while at the same time have fast read and writes, you really must partition your data. Here is a few lessons learned while I was attempting to split up my graph collection into partitions.

Number of documents

The number of documents you will store of a certain type has an impact on what your partition key should be.

My current rule of thumb is to have roughly up to 9999 documents in a partition. So if I expect somewhere between 1-9 million documents of a given type, I will have 1000 partitions for that type of document.

If the number of documents is low (<1000), then partitioning will probably not have much impact on the performance.

Distribute read/write evenly across partitions

Your logical partitions (the partitions you define with a partition key) will be evenly spread across physical partitions. This happens automatically by Cosmos DB, so it is completely transparent to your application.

Every document with the same Partition Key belongs to the same logical Partition, and will therefore be placed in the same physical partition. To ensure high throughput all access should be evenly distributed against your logical partitions.

Identifier

I prefer not using document id's generated by a database, but rather use some real value connected to the domain I'm modelling, such as product numbers, social security number, email, registration number etc.

When you are using partitioning in CosmosDB, the document Id is not really the unique identifier anymore. The document id must only be distinct within the partition, so the real Id is the combination of the document Id and the partition key.

The impact of this is that given an Id, you should immediately know which partition to look in. If not, you would be making an expensive request across all partitions to look up the id.

My current solution

Since the partition key is defined at the collection level, it means that every single document within that collection must have a common property used as partition key, and yes, a collection can contain multiple document types. Because of this, I find it easiest to just make sure that all my document types have a property named "PartitionKey".

After a lot of trial and error I decided that a good way to create the Partition Key value was through a combination of hashing and modulo

using System; using System.Security.Cryptography; using System.Text; namespace Demo { public class PartitionKeyGenerator { private readonly MD5 _md5; public PartitionKeyGenerator() { _md5 = MD5.Create(); } public string Create(string prefix, string id, int numberOfPartitions) { var hashedValue = _md5.ComputeHash(Encoding.UTF8.GetBytes(id)); var asInt = BitConverter.ToInt32(hashedValue, 0); asInt = asInt == int.MinValue ? asInt + 1 : asInt; return $"{prefix}{Math.Abs(asInt) % numberOfPartitions}"; } } }

When creating documents I assign my own Id to the document, followed by generating the Partition Key.

var pkg = new PartitionKeyGenerator(); product.Id = "1"; product.PartitionKey = pkg.Create("product", "1", 1000);

The result is that every time I add a new product, the product documents will get evenly distributed in different partitions. In this case I expect to store a couple of millions product, so I set the desired number of partitions to be 1000. The partition keys would be "product3", "product738" and so on. For other types of documents I would just assign a different prefix and number of partitions.

When other clients requests a product based on its Id, you can run the same method to find which partition the document should be in before retrieving it directly.