TL;DR

We test Riak 2.0 to get an idea of performance characteristics under different workloads and use cases: cluster expansion, cluster contraction, node failure, memory resident, non-memory resident, rate-limited, and with strong consistency enabled. We’ll be at Ricon later this month. Reach out to @mustwin!

Why Test Distributed Databases?

Many applications today are data-centric. Typically, organizations start with traditional single vertically scalable databases, such as PostgreSQL, or MySQL. Computers today can do a lot and vertical scalability has continued to be possible thanks to Moore’s Law—but these days if your problem set can fit in memory, it’s trivial. Unfortunately, single-machine databases are difficult to make available (in the reliability engineering sense). Component failure is a very real risk in these setups. The ubiquity of commodity computation, along with common deployments on public and/or shared cloud infrastructure makes it even more likely. Aside from SPOF issues, vertically scalable databases face very real limitations when the workload exceeds the capacity of one machine. Distributing the work of these systems through sharding is a very difficult application level problem.

Fortunately, over the past few years newer generations of databases have become available to the FLOSS community that address these concerns. Amazon’s Dynamo paper has inspired a number of NoSQL databases such as Riak, along with Cassandra and Voldemort; and Google’s BigTable paper led to Hbase.

It hasn’t been readily discussed how these databases perform on the public cloud, what scalability characteristics they have, and how they perform under failure.

Today, we’re investigating Riak to see if we can get predictable performance out of a commodity NoSQL database.

Riak Overview

Riak is a Dynamo store that keeps a fixed number of vnodes, in a ring, distributed across a set of physical nodes, with a surjective relationship. This allows you to scale out more easily, and take advantage of heterogenous hardware configurations more easily. It’s built using Erlang/OTP, a language, and a framework primarily designed for building highly resilient, distributed applications. It has pluggable backends, such as LevelDB, Bitcask, and an in-memory store. Basho, the developers of Riak, recently released version 2.0.0 which features strong consistency and significant performance optimizations.

The Benchmark

The Testbed

Basho Bench

The test toolkit we leveraged was a tool built by Basho themselves. They built a tool called basho_bench that’s designed for testing out distributed, key-value stores. It allows users to specify options such as their key distribution, values, etc. All of our tests used binary representations as integers for keys, and pseudorandom, un-compressible data for values. We had to modify basho_bench to enable it to run across multiple nodes, because the capacity of one node to generate traffic was smaller than the total serving capacity of our target clusters. In our testing, load generation was spread across three nodes, with one node also acting as a test coordinator. There were 256 virtual clients spread across said 3 nodes. All benchmarks were run with {mode, max}, a worst case scenario for load generation. On connection termination, the client was configured to immediately reconnect, and retry the op. The benchmark used the protocol buffers client, which maintains a long-lived connection to the Riak cluster. In order to enable us to reason about an elastic cluster, we added a new driver operation, reconnect. In every one of our tests, 1/10000 operations per client driver was going to result in reconnecting a new node.

Google Cloud

Many of our customers primarily deploy their applications in the public cloud, so we decided to test Riak out atop Google Compute Engine. Each of our instances were of the type n1-standard-16 deployed in the us-central1-f zone. The instances themselves were running the image ‘backports-debian-7-wheezy-v20140904’ All Riak nodes were deployed in the same network. Network load balancing was used to distribute the traffic amongst the Riak nodes, with health checking doing a /ping on the HTTP interface with default healthcheck intervals.

Riak Tuning

The Riak cluster itself was deployed using the LevelDB backend. LevelDB’s throttled, intermediate compactions, tiered storage, and non-memory-resident keyspace are ideal for many production workloads. The cluster had 256 vnodes, as the peak size of the cluster was expected to approach 144 VCPUs. The erlang.distribution_buffer_size (+zdbbl for other operators) was set to 1GB. The disks themselves were mounted with noatime,barrier=0,data=writeback, as Riak relies on multiple nodes for resiliency as opposed to relying upon the reliability of a single instance. The disks themselves also had the deadline scheduler, with a readahead of 0. These two settings were recommended by the Google Compute Engine tuning guide. There were other tuning options available, but we chose to go with the closest to an out of the box Riak instance as as possible.

Results

Initial Test

Our initial test was to get a baseline. It was done on 6 nodes in a stable cluster state. Each instance had 50GB of SSD space assigned to it. This throughput limit later proved to be our bottleneck, but this was the upper-bound of capacity we could get within our quotas on GCE. The test was done with a value size of 1KB, and the keyspace was 100,000 keys. The load distribution was 90% reads, with 10% updates. One prior test had been ran on the cluster to populate the cluster.

This is a memory-resident workload. The median total combined operations per second was 25994 Ops/sec. The standard deviation of the median get latency was 176 microseconds, and 485 microseconds for updates. The median 99.9% update, and get latencies stayed below 100 milliseconds for the duration of the test. Although, there was a long-tail of latency, various latency leveling techniques could be used in order to combat these outliers. The total cost of this cluster would be $3,349.16/mo based on the Google pricing calculator.

Cluster Expansion Test

One of the benefits of distributed, Dynamo databases is that cluster expansion is relatively easy. Often times, unexpected load on traditional databases leaves customers, and engineers in a crux, where the entire system is unavailable. Upgrading, and expanding hardware to said databases is typically a tenous, multi-hour offline operation. In this test, we show the effects of taking a fully loaded cluster of 6 nodes, and growing it by 3 nodes, 180 seconds after we started the test.

As you can see in this test, 5 minutes after cluster expansion was initiated, an increased throughput of 27% was realized. On the other hand, update latency took slightly longer to converge, as that was based on rebalancing the Riak_api, and protocol buffers coordination load across the rest of the cluster. During the entire rebalancing period, median latency slightly increased, but handoffs can be throttled, and fewer nodes can be added in a single handoff claimaint handoff quanta, to make this process have fewer side effects.

Node Failure Test

Node failure is a normal part of life for operators today. This is really where Riak shines. We started the test, and 90 seconds afterwards, we killed 1 of 9 nodes by prompting a forced, immediate shutdown of Riak through a pkill. After 420 seconds, an operator initiated force-remove was activated, and 2100 seconds later, the cluster had converged.

There was only one operation that actually returned in an error, unfortunately that didn’t show up on the graph. The rest of the operations immediately retried, and were successful. Realistically, convergence could have taken up to 2 seconds, given that’s how long Google’s default healthcheck window takes. Node failure is simply a non-event in a properly built, and planned Riak cluster.

Cluster Contraction

Many modern applications have binaural loads. Again, due to the elasticity offered by the modern cloud organizations have begun to take advantage of the ability to dynamically scale up, and down workloads. Although, Dynamo doesn’t have BigTable-like dynamically partitionable ranges, it has a fixed-set of vnodes with a surjective mapping on top of the nodes. These vnodes can be shuffled dynamically as needed. In this test, we removed 2 of 7 nodes from the cluster, and examined how cluster degradation took place. We initiated node removal at as soon as we began the test, and the nodes finished leaving at 2100 seconds.

This test showed that Riak is dynamically resizable, but there were more errors presented here than with the prior failure case. We found that those were caused by both issues in Google’s load balancer, where it continued to direct traffic to nodes that had been gracefully shut down, as well as Riak nodes returning errors during the test. We’re currently working on tracking down the errors within Riak itself. This didn’t present itself when only one node was to leave the cluster.

Non-memory Resident Workload

Although, databases are great for in-memory workloads, occasionally, we need to take the hit of using SSDs. The test was run on 6 nodes, with 150 GB of SSD storage each. This test was ran with keyspace of 100M keys, using a pareto access distribution. The reason a pareto access distribution was chosen is that we postulate that most non-memory resident workloads will have a working set that’s the site of memory, or smaller, otherwise, guaranteeing any kind of performance SLA is simply out of the question. Each key was updated with a 2 kilobyte value. The entire keyspace was pre-populated, to give a corpus of 600 GB, after replicating with an n-val of 3.

Unfortunately, this test did not prove as fruitful as the others. When leveldb compaction triggered, it resulted in stalls, because the IOPs, and throughput available from the Google persistent storage were suboptimal for an OLTP workload. Although, the store provides synchronous read, and write throughput, contention occurred within LevelDB itself during periods where it was doing compactions.

Non-memory Resident Workload, with rate limiting

So, after some thought, we decided to run the test with the same access pattern, and dataset with the same access pattern and key size. Though, we gave ourselves a little more I/O capacity. The previous test showed I/O starvation lead to problems, and although lowering the rate of the test would help, it only delayed the time until a stall longer. We upgraded drive size from 150GB->800GB, giving us ~5x more I/O capacity. We also limited the upper rate of queries to 17,900 queries per second. This was primarily to cap the possible upper bound of queries to hit Riak, and get it into an I/O death spiral.

This test was far better in terms of showing the capabilities of Riak. One of the major benefits of the cloud is that you can generate system configurations that would be typically considered to be outliers, or heterogeneous. A small boost in I/O capacity, and tuning our test enabled us to realize the true potential of our cluster.

Strong Consistency

One of the new features of Riak 2.0 is strong consistency. Riak now has an extension called Riak_ensemble that effectively enables per-preflist multi-paxos. This means that you can implement data structures that require coordination atop Riak. Traditionally, Riak has veered away from these because of the complexity, and performance implications therein.

Unfortunately, it appears that Basho was correct, strong consistency does perform worse than eventual consistent in the 1:1 case. I/O was starved during the test due to more information (AAE trees of the strong consistency information) being logged to disk. Additionally, because requests are serialized by the multi-paxos peers, it resulted in negative performance implications. One odd, and interesting discovery was that at some point during the test, several leaders were re-elected due to pings timing out. When this reelection occurred, it caused a slight hiccup, because certain requests were unserviceable, it also resulted in the cluster converging at a more performant state.

Strong Consistent, tuned

The previous test showed strong consistency failing to have predictable performance. We decided to attack this by initially upping the vnode count to see if that would help in performance, as it would result in a greater ability to realize parallelization. The problem with this is that it quickly resulted in I/O starvation. In order to combat this, we added more I/O capacity to our cluster, and tuned it. We upgraded from 150GB disks, to 800GB disks, giving us roughly 5x more I/O capacity:

The throughput increases significantly without much work. Additionally, the latencies become much more predictable in testing. Unfortunately, we suffer from a similar problem as to the first test, that when there is a reelection, latency spikes due to the leader lease, and the time it takes for the failure detectors to kick in. One interesting point, is that if we look at the median fetch latency, it rivals, or is better than the eventually consistent fetch time. This is due the nature of the long-lived multi-paxos leader, and reads take only 2 RTT (client->riak_api->master), as opposed to the normal 3 RTTs (Client->riak_api->coordinator->Preflist). This test shows that although strong consistency comes with a cost, it’s not as bad as people originally predicted, and with some adjustments, the costs can become quite acceptable.

Strong Consistency With Rate Limiting

One of the problems in the last test we notices was that during cluster strain, the riak_ensemble would go into a reelection cycle, resulting in increased latency, and unpredictable performance. Additionally, update latency was incredibly unpredictable. We went ahead and limited the throughput of the testing tool, and looked at whether there were any cases in the leader election code where pings were being delayed, and we found that latency was much better.

This test confirmed our hypothesis, that limiting the load on consistent Riak allows it to have more predictable performance. Given that Basho just released Riak 2.0, with their first version of strong consistency, still a non-commercial feature, the expectation is that Basho is going to work on shoring up performance around this featureset.

Take Aways

We’ve shown that the performance of Riak heavily depends on on the workload, tuning, and underlying hardware for the database itself. The cloud makes tuning relatively simple, and resource allocation matters a lot for specific workloads. There is no single cluster configuration to fit all workloads that can come out of the box, Riak must be tuned for the specific use case. Lastly, if the application works in synergy with Riak to allow for flow control, it enables better performance for applications that are competing for cluster resources.

Riak performs great in light of failures. In our tests of various cluster expansion, contraction, and failure scenarios, the cluster performed nearly perfectly. In a modern datacenter, where components fail every day, and the only realistic path forward is to build reliable components out of unreliable hardware, Riak is ideal.

Lastly, strong consistency does work, is practical with the right planning. The get latency is actually lower in the ideal case, and comparable to eventually consistent Riak in the stable state. On the other hand, the update latency is within the realms of a high-speed, transactional database.