Due to the nature of our business, high availability is extremely important to VictorOps and something we take very seriously. We know our customers rely on our service to be always up so that we can process and deliver their alerts and notifications. One of the key components that is critical to the functioning and availability of any SaaS service is the datastore.

At VictorOps we have historically used MySQL in high availability Percona Xtradb Clusters for operational and analytical uses. While MySQL is a mature and reliable relational database and has performed well, we had planned from early on to move to a more horizontally scalable datastore in order to meet our scalability and high availability requirements (including multi-datacenter failover capabilities).

Last fall we began to evaluate datastore alternatives that could help improve scalability, both relational and NoSQL, before deciding to use Cassandra. After evaluating these options we decided that Cassandra was the best option to help deliver on these extreme high availability and reliability requirements.

Some of Cassandra’s strengths that influenced this decision include:

High Availability – Cassandra is a distributed database where all nodes are equivalent (i.e. there is no master node so clients can connect to any available node). Data is replicated at a configurable number of nodes, so that failure of some number of nodes (depending on the replication factor) will not result in loss of data. From the CAP theorem perspective (Consistency, Availability, Partition tolerance), Cassandra’s design provides tunable consistency at the read/write request level, which allows you to increase availability at the expense of consistency where it makes sense.

Scalability – Cassandra has been shown to be linearly scalable. Since each node adds processing power as well as data capacity, it is possible to scale incrementally to very large data volumes and high throughputs by simply adding new nodes to the cluster.

“Self-healing” – Cassandra’s eventually consistent data model and node repair features ensure that the consistency of the cluster will be automatically maintained over time. This also makes it very easy to recover failed nodes, increase or decrease the size of the cluster as needed, and even do in place version upgrades (in most cases).

Multi-datacenter replication – Cassandra’s node replication and eventual consistency features are core to the functioning of this distributed system. These features were designed from the outset and have been improved and battle tested throughout its lifetime and are now considered highly reliable. These features were therefore easily extended to clusters that contain nodes in different geographical locations, and due to the eventual consistency model this includes support for true Active-Active clusters. In fact, Cassandra has the reputation of having the most robust, reliable multi-datacenter replication of any datastore in the industry. This is an important part of our multi-datacenter failover capability at VictorOps and was one of the major factors in the decision to go with Cassandra.

Large community – Cassandra is an Apache project with a very large, active community including influential companies like Netflix. In addition DataStax continues to drive development and continual improvements of the Cassandra core as well as operational components (they also provide support subscriptions).

While Cassandra has many advantages including those described above, it is very different than most other datastores. Cassandra is not a relational database and while the interface to retrieve data (CQL) is very similar to SQL, the underlying data storage and access model is very different. As a result, the performance and operational characteristics of Cassandra are very dependent on the application data model. Therefore, it is important to understand how data is accessed and to design the data model so that it will perform well on the common queries that the application uses.

One data model on which Cassandra performs particularly well is log structured (or time series) data. In this type of model, the data represents a series of measurements or events that happen over time, rather than a set of updates to existing data items. Cassandra allows storing these “immutable” events contiguously on disk ordered by a clustering key (which is often insertion time). It is therefore very efficient to return the set of items based on this clustering key, using serial rather than random disk I/O.

There are many parts of VictorOp’s data model that naturally map to this log structured approach. For example, an incident’s lifecycle is comprised of a set of events that cause the state of the incident to change (e.g. a Critical alert, Creation of an Incident, a Paging escalation, an Acknowledgement, a Recovery, etc). VictorOps surfaces this in the notion of the main Timeline as well as an Incident Timeline.

Obviously the choice of a datastore is an important decision that has a major affect on the scalability, reliability, availability, maintainability and extensibility of a SaaS service. While Cassandra requires more awareness of the underlying data access patterns and the operational characteristics when designing the system, we feel that the benefits it provides in terms of availability, linear scalability and seamless, reliable multi-datacenter replication are a great fit for our business requirements, and will scale to meet our needs in the future.