by Christos Kalantzis

In an article we posted in November 2011, Benchmarking Cassandra Scalability on AWS — Over a million writes per second, we showed how Cassandra (C*) scales linearly as you add more nodes to a cluster.

With the advent of new EC2 instance types, we decided to revisit this test. Unlike the initial post, we were not interested in proving C*’s scalability. Instead, we were looking to quantify the performance these newer instance types provide.

What follows is a detailed description of our new test, as well as the throughput and latency results of those tests.

Node Count, Software Versions & Configuration

The C* Cluster

The Cassandra cluster ran Datastax Enterprise 3.2.5, which incorporates C* 1.2.15.1. The C* cluster had 285 nodes. The instance type used was i2.xlarge. We ran JVM 1.7.40_b43 and set the heap to 12GB. The OS is Ubuntu 12.04 LTS. Data and logs are in the same mount point. The mount point is EXT3.

You will notice that in the previous test we used m1.xlarge instances for the test. Although we could have had similar write throughput results with this less powerful instance type, in Production, for the majority of our clusters, we read more than we write. The choice of i2.xlarge (an SSD backed instance type) is more realistic and will better showcase read throughput and latencies.

The full schema follows:

create keyspace Keyspace1

with placement_strategy = 'NetworkTopologyStrategy'

and strategy_options = {us-east : 3}

and durable_writes = true; use Keyspace1; create column family Standard1

with column_type = 'Standard'

and comparator = 'AsciiType'

and default_validation_class = 'BytesType'

and key_validation_class = 'BytesType'

and read_repair_chance = 0.1

and dclocal_read_repair_chance = 0.0

and populate_io_cache_on_flush = false

and gc_grace = 864000

and min_compaction_threshold = 999999

and max_compaction_threshold = 999999

and replicate_on_write = true

and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'

and caching = 'KEYS_ONLY'

and column_metadata = [

{column_name : 'C4',

validation_class : BytesType},

{column_name : 'C3',

validation_class : BytesType},

{column_name : 'C2',

validation_class : BytesType},

{column_name : 'C0',

validation_class : BytesType},

{column_name : 'C1',

validation_class : BytesType}]

and compression_options = {'sstable_compression' : ''};

You will notice that min_compaction_threshold and max_compaction_threshold were set high. Although we don’t set these parameters to exactly those values in Production, it does reflect the fact that we prefer to control when compactions take place and initiate a full compaction on our own schedule.

The Client

The client application used was Cassandra Stress. There were 60 client nodes. The instance type used was r3.xlarge. This instance type has half the cores of the m2.4xlarge instances we used in the previous test. However, the r3.xlarge instances were still able to push the load (while using 40% less threads) required to reach the same throughput at almost half the price. The client was running JVM 1.7.40_b43 on Ubuntu 12.04 LTS.

Network Topology

Netflix deploys Cassandra clusters with a Replication Factor of 3. We also spread our Cassandra rings across 3 Availability Zones. We equate a C* rack to an Amazon Availability Zone (AZ). This way, in the event of an Availability Zone outage, the Cassandra ring still has 2 copies of the data and will continue to serve requests.

In the previous post all clients were launched from the same AZ. This differs from our actual production deployment where stateless applications are also deployed equally across three zones. Clients in one AZ attempt to always communicate with C* nodes in the same AZ. We call this zone-aware connections. This feature is built into Astyanax, Netflix’s C* Java client library. As a further speed enhancement, Astyanax also inspects the record’s key and sends requests to nodes that actually serve the token range of the record about the be written or read. Although any C* coordinator can fulfill any request, if the node is not part of the replica set, there will be an extra network hop. We call this making token-aware requests.

Since this test uses Cassandra Stress, I do not use token-aware requests. However, through some simple grep and awk-fu, this test is zone-aware. This is more representative of our actual production network topology.

Latency & Throughput Measurements

We’ve documented our use of Priam as a sidecar to help with token assignment, backups & restores. Our internal version of Priam adds some extra functionality. We use the Priam sidecar to collect C* JMX telemetry and send it to our Insights platform, Atlas. We will be adding this functionality to the open source version of Priam in the coming weeks.

Below are the JMX properties we collect to measure latency and throughput.

Latency

AVG & 95%ile Coordinator Latencies

Read : StorageProxyMBean.getRecentReadLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated

: StorageProxyMBean.getRecentReadLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated Write: StorageProxyMBean.getRecentWriteLatencyHistogramMicros() provides an array which the AVG & 95%ile can then be calculated

Throughput

Coordinator Operations Count

Read : StorageProxyMBean.getReadOperations()

: StorageProxyMBean.getReadOperations() Write: StorageProxyMBean.getWriteOperations()

The Test

I performed the following 4 tests:

A full write test at CL One A full write test at CL Quorum A mixed test of writes and reads at CL One A mixed test of writes and reads at CL Quorum

100% Write

Unlike in the original post, this test shows a sustained >1 million writes/sec. Not many applications will only write data. However, a possible use of this type of footprint can be a telemetry system or a backend to an Internet of Things (IOT) application. The data can then be fed into a BI system for analysis.

CL One

Like in the original post, this test runs at CL One. The average coordinator latencies are a little over 5 milliseconds and a 95th percentile of 10 milliseconds.