Shiny and new

Materialised views, a new storage engine and more – Apache Cassandra project chair Jonathan Ellis takes us on a tour of the highlights in the latest Cassandra release. Cassandra has grown from a crowd of early-adopting users into the market majority.

The Apache Cassandra community has grown dramatically over my five years as project chair. This adoption has taken us from a crowd of early-adopting visionaries into the pragmatic market majority. Cassandra has needed to evolve and grow to meet the demands of this more conservative audience.

This has culminated in the launch of version 3.0 of Cassandra, which is the most significant in the project’s history. Here, I go through what is new within the new version of Cassandra and why those changes have been made.

New storage engine

One of the biggest elements of Cassandra 3.0 is the new storage engine. Previously, the storage engine grouped data into key/value cells within a partition, so if you had a partition containing 500 rows of a table with 4 columns, Cassandra would store 2 000 cells with no explicit row structure. This made inefficient use of bytes-on-disk as well as adding the cost of turning cells back into rows repeatedly for each query. This also imposed a lot of internal complexity as Cassandra increasingly needed to reason about data in groups of rows, affecting everything from simple features like LIMIT to complex ones like materialised views.

The new engine in Cassandra 3.0 is still cell based – which allows Cassandra to efficiently store sparse rows and updates – but it now stores the row structure along with the cells. We also now store recurring information like column names once per partition, rather than repeating it for each cell, giving much of the benefit of compression without the performance hit.

At a high level, this change reduces the volume of storage required to store a given dataset, and sets the stage for further performance improvements now that less data needs to be read and processed from disk. It also allows building increasingly sophisticated CQL features, like 3.0’s materialised views.

Materialised Views in Cassandra 3.0

A key way that Cassandra provides superior performance is by recognising that in a clustered database, stored across many machines, you need to avoid doing joins: a join would pull data from many machines in the cluster and incur a big hit to performance. Instead, Cassandra emphasises denormalisation. This involves writing an extra copy of the data, repartitioned by your query pattern, which allows Cassandra to answer your queries from a single partition on a single machine.

Since Cassandra is much faster at writes than traditional relational databases, it’s totally fine to “spend” writes to optimise reads. However, before Cassandra 3.0, this denormalisation needed to be done manually. For example, consider a table of songs:

CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text );

If we wanted to query these songs – which are partitioned by id – by the album column, then we could use an index. However, this implies doing a scatter/gather operation across all nodes in the cluster, which can then potentially limit performance.

In Cassandra 3.0, we can forego this and instead repartition our songs by album with a materialised view:

CREATE MATERIALIZED VIEW songs_by_album AS SELECT * FROM songs WHERE album IS NOT NULL PRIMARY KEY (album, id);

This repartitions the songs by album in the materialised views, so all the songs in a given album will be in the same partition in our new materialised view, which can be queried just like a table:

SELECT * FROM songs_by_album WHERE album = 'Tres Hombres';

For developers, making use of materialised views dramatically reduces the effort required to denormalise your data to match your application’s needs.

Other additions to Cassandra

While these were the headline additions for version 3.0, there have been other important improvements and additions to Cassandra as well. These include JSON support to make it easier to work with data, as well as integrating with microservice architectures. As an example, the INSERT statement now accepts a JSON variant. Suppose we have a table defined like this:

CREATE TABLE users ( id text PRIMARY KEY, age int, state text );

With native CQL we would insert a row like this:

INSERT INTO users (id, age, state) VALUES ('user123', 42, 'TX');[/java]

The JSON version looks like this:

INSERT INTO users JSON '{"id": "user123", "age": 42, "state": "TX"}';

The JSON-encoded map is simply a CQL string literal that is a JSON encoding of a map where keys are column names and values are column values. This means that drivers don’t need to do anything special to support INSERT JSON. Native JSON support in Cassandra is a great fit for modern microservice architectures.

Cassandra has also added support for user-defined functions to allow simple calculations and aggregation to be performed server-side. We have also optimised the storage of hinted handoff data to improve performance when cluster members go down temporarily.

Other areas that have received optimisation attention are the commitlog (where compression is now available and enabled by default) and network message coalescing, which improves performance for virtually all clusters but especially for those in “noisy” network environments like Amazon EC2.

To assist in enterprise Cassandra deployments, Cassandra has added role-based access control in place which offers simpler security management for large teams.

Last but not least, Cassandra now offers full support for Microsoft Windows alongside Linux. For developers working on Microsoft platforms – whether this is based on traditional operating systems or in the Azure cloud – this helps them see the same potential for their applications as other developers.

You can download Apache Cassandra 3.0 here. If you’re new to Cassandra, you can check out a quick demo on Planet Cassandra first.