Open-sourced at Github: https://github.com/distributedio/titan

TiKV is an awesome distributed KV database (abbr. TiKV) which is built by PingCAP, a great company. Derived from Google’s Spanner (Google’s scalable, globally-distributed database), TiKV is a transactional database with the support of ACID. It supplies strong consistency and high availability by using raft to achieve consensus between multiple replicas. In TiKV, the data is partitioned into regions and TiKV ensures the regions to be rebalanced if there is a node in or out. TiKV is now a member of CNCF. An another project called TiDB is running on TiKV. It supplies a MySQL compatible layer which makes it like a MySQL database equipped with full ACID distributed transaction and raft-based consensus protocol, thus to achieve strong consistency and high availability.

Yeah, as what you may think, TiDB is a NewSQL database.

PingCAP has already done lots of tough things and what we will do here is to solve the lighter problems and to make the TiKV ecosystem better.

Just like TIDB, Titan is also running on TiKV but supplies a Redis compatible layer. Titan is born with high availability and strong consistency derived from the power of TiKV. It fully implements the Redis transaction(Ex. watch…multi…exec)and it is a distributed design.

Let’s step into the world of Titan.

If you are new to TiKV, don’t worry, we will not deep too much into TiKV. Note that we have been developing Titan followed by two rules — service stability and Redis compatibility. One thing is that we store some necessary information as metadata such as the type and encoding information of the object. The type is to efficiently avoid executing commands against the wrong types and the encoding information is to support multiple encodings which can be used to gain performance or save disk footprint. We use a garbage collector to recycle a removed object, so you can delete a large list or a hash-table immediately. We also support to actively expire keys which have already expired using a strategy similar to GC. To serve multiple applications, we use namespaces to isolate data(called multi-tenancy technology), then the users accessing Titan is just like accessing a standalone Redis.

In this article, I will introduce some details about the design and implementation mentioned above.

Storing the property of a complex data structure as metadata

One great characteristic of Redis is its rich data structures, you can use lists to store recommendation lists, use hashes to store relationships or keep a rank list in a sorted set. We call these data structures complex data structures (excluding strings). A complex data structure is described with such information in Redis as the type of an object, its encoding format or the count of members. We call this information metadata. Redis fetches the metadata before actually manipulating an object to check if the operation is valid, and updates the metadata (e.g. the length of a list) to correctly describe the current state of the object after modifying. The cost is almost negligible compared with Titan for Redis store the metadata in memory. Titan keeps all states including the metadata in TiKV to avoid data loss in case the crash of the Titan Server occurs. It’s an overhead to fetch the metadata. We made our best to eliminate the metadata and finally we found that the metadata is necessary. Mainly for two reasons, one is to describe an object, like Redis and the other one is to decouple the object and real data so that the object could be deleted instantly while the real data would be deleted by our GC afterwards. This scheme avoids blocking when deleting a large object.

Now let’s make it more explicitly. Suppose you want to access an object now, its metadata will be fetched from TiKV first, then you read or modify the object and after that, all the written data and the new metadata will be updated into TiKV. All these operations are combined in a TiKV transaction to avoid data corrupting. The overhead of metadata is proved small enough (e.g. about 1ms) in our benchmark environment.

Mapping a Redis List to flat key-values

TiKV can be thought as a sorted map. The hardest problem is to map a Redis list to TiKV key-values. We use a UUID as Object ID to uniquely identify an object and store this ID in the metadata with a float64 index, which will be concatenated to the Object ID to generate a key in TiKV. Then we keep two values left and right in the metadata separately pointing to the head and tail of the list.

A key of a list item(data key) looks like this:

D:{Object ID}:{Index}

D is a tag indicating the key to store actual data and at the same time we keep another key called meta key to store the metadata, looking like this:

M:{Redis Key}

LPUSH and RPUSH (two basic commands of Redis) are really easy to implement in our design. Just use ‘left minus one’ and ‘right plus one’ to implement LPUSH and RPUSH as shown in the following example.

The hard part is how to insert an element, that’s why we use float64 other than the int64 as the index. It is impossible to insert a value between n and n+1 for integers. By using float64, we calculate the median value of the pivot and its next, then the median is used as the index of the inserted item, which keeps the order of the list items.

There comes a problem with this method: a float64 cannot be divided infinitely. Now as a temporary solution, we throw an error when the precision is overflowed, in fact, we can rebalance the indexes around the inserted index to avoid precision loss.

As we mentioned above, TiKV is a sorted map, the float64 is encoded to be memcomparable when building a key.

Using GC to delete a large object

We design list to store large numbers of members. It is not acceptable to delete a too large sized list (e.g. 1-million) directly online. To avoid blocking when deleting a large list, our solution is to delete the meta key only and add the Object ID to a GC list. Then a GC worker will scan the GC list and delete the data associated with the Object ID. When a new list is created with the same Redis key, a new Object ID is assigned, so there is no conflict between the new list and the deleted object.

A GC list is also maintained as key-values in TiKV. All the items in the GC list share a common prefix.

A GC item key looks like this:

GC:{data key prefix}

A data key prefix of a list looks like this:

D:{Object ID}

So a really GC can be:

GC:D:XXXX

The GC worker scans TiKV with “GC” as prefix and gets the data key prefix. Then all the data with the same data key prefix will be deleted in the background.

Actively expiring on keys

Just like the GC strategy, we maintain an expiration list to store the objects that need to be expired.

The operation of adding a key to an expiration list is in the same transaction with updating an object. For example, the command “set foo bar ex 10” set the key foo with the value bar and a TTL with 10 seconds. Setting the key-value and adding to the expiration list will be submitted automatically in one transaction, so there is no inconsistency between the expiration list and the actual data.

The expiration time is encoded as part of a key to ensure the ascending order.

AT:{TS}:{Meta Key}

An expiration worker scans the list until “TS > now” and then deletes the meta key. The data key will be added to the GC list if the object is a complex data structure.

Multi-tenancy support

Multi-tenancy is a useful feature which is designed to share Titan with multiple applications. All the users using Titan is just like using a standalone Redis. The data is isolated by namespace which is actually a common prefix of all keys in TiKV.

To support the select command of Redis, we also add a database ID following the namespace, finally, the key of a list object looks like this:

{Namespace}:{Database ID}:M:{Redis key}

{Namespace}:{Database ID}:D:{Object ID}

The key problem is how to get namespace without violating the Redis protocol and data schema. We achieve this by encoding namespace using HMAC algorithm. The client issues the auth command with a token which is signed by the DBA, then Titan will verify the token and parse the namespace. A default namespace is used with no authentication required(which is permitted in Redis). The SERVER-KEY to sign a token is configured in Titan.

A token format looks like this:

{namespace}-{created time}-{token version}-{digest}

The digest above (e.g. 4th field) is calculated by the first three parts and the SERVER-KEY is used to prevent the token being falsified, ensuring high security.

Benchmarks

Benchmark environments:

Titan Server: 2 Nodes (CPU：Intel(R) Xeon(R) CPU E5–2630 v3 @ 2.40GHz, Mem：32G)

TiKV Cluster: 3 Nodes (CPU: Intel(R) Xeon(R) CPU E5–2630 v4 @ 2.20GHz, Mem: 96G, Disk: 5*480G)

500G data was filled before benchmarking.

We benchmark Titan using fperf. To install fperf with Redis benchmarking support:

go install github.com/shafreeck/fperf/bin/fperf-build

fperf-build github.com/fperf/redis

fperf -h to see help information

GET and SET performs well with low latency and high QPS

LPUSH, LPOP cannot perform like GET and SET, however, there is still much room for improvement.

The benchmark result will be different in a different environment, so the results here cannot be used as an absolute standard of Titan’s performance. However, it really gives us much information, for example, you can’t expect Titan performs like Redis, because they are completely different designs. Titan stores data on disk and use raft to replicate data to avoid data loss while Redis stores data in memory which guarantees super high performance.

Conclusion

This article shares some ideas to design and implement Titan mainly about how to map a Redis data structure to flat key-values, how to delete a large object, etc. Titan is a young project with active development, there are many documents to be supplemented and a lot of features left to be implemented. We welcome any form of contributions, feel free to fire an issue if you have any suggestions or requests.