This blog is the first part of a series “Spark – the Pragmatic bits”. Get the full overview here.

Typically, one cannot rely on Cassandra alone when performing aggregations, data analysis and the like. We can use it to capture all the data we need, but on its own Cassandra’s capabilities are insufficient to use that data to its full extent. In addition to this, moving large amounts of data in and out of Cassandra is often problematic.

In this blog post I intend to show how adopting Spark alongside Cassandra can help solving these problems. DataStax ships their solution (DataStax Enterprise) with a pre-packaged integrated Cassandra and Spark cluster, but the actual libraries (cassandra-spark-connector) to make this happen are open source. For now, I will not write about the details of deployment of a Cassandra/Spark mixed cluster (this will be covered in a follow up post), but rather focus on the functionality that it can provide.

Spark from ten thousand feet

Spark’s own documentation describes it as “a fast and general-purpose cluster computing system”. Its roots go back to Twitter who used it as their data analytics solution, but it’s been a full-blown Apache project for several years now, currently at version 2.1.0. Spark is written in Scala, but has APIs for Java, Python and R. Having used the Scala and Java interfaces, my experience shows that the Scala API provides a smoother experience over the Java one.

In some ways you can say, that Spark is the spiritual successor of Hadoop MapReduce – it’s younger, faster, nicer to use, and generally just better, but solves the same problem in a similar fashion. However, most importantly, Spark is agnostic to the underlying storage layer – you can use HDFS, the local filesystem, an RDBMS, or in our case, Cassandra.

The core abstraction of Spark is called an RDD: a Resilient Distributed Dataset. There are very few assumptions made about it – it is a set of data that :

Is distributed/partitioned across the Spark cluster (and potentially in the datastore used)

Is replicated across the Spark nodes for fault tolerance

Note that Spark’s definition of an RDD is actually true for Cassandra’s storage model, as partitions in Cassandra:

Are distributed/partitioned across the Cassandra nodes

Are replicated across the Cassandra nodes for fault tolerance

As a result, it is possible to “teach” Spark to understand and leverage Cassandra’s partitioning model – thus resulting in a very efficient combination of storage and computing.

How does it work?

The fundamental idea is quite simple: Spark and Cassandra clusters are deployed to the same set of machines. Cassandra stores the data; Spark worker nodes are co-located with Cassandra and do the data processing.

Spark is a batch-processing system, designed to deal with large amounts of data. When a job arrives, the Spark workers load data into memory, spilling to disk if necessary. The important aspect of this is that there is no network traffic. The Spark worker understands how Cassandra distributes the data and reads only from the local node.

Afterwards the “usual” map-reduce style behaviour takes place: certain operations can and will be executed on local nodes, others require data to be exchanged between the Spark workers (in Spark terminology this is called a “shuffle”).

Finally, the results are written back somewhere – again there are multiple storage options ranging from dumping a set of CSV files to local disk, HDFS, or databases such as Cassandra or an RDBMS.

How do I use it?

In the second half of this post, I’m going to show some examples of how to use Spark and Cassandra together. The use-cases will be written around a scenario of an imaginary bank that captures credit card transactions of its customers and wants to get more out of this data.

Laying the foundations

To begin with, let’s explore what is required from the Cassandra perspective to store the data. To do this on a basic level, we only need a very a simple schema. Let’s start with two tables: one to capture basic customer information and another to log card transactions:

CREATE TABLE cc.cc_customers ( id text PRIMARY KEY, county text, name text ); CREATE TABLE cc.cc_transactions ( customerid text, year int, month int, id timeuuid, amount int, card text, status text, PRIMARY KEY ((customerid, year, month), id) );

Table cc_customers is fairly simple: I’ve added the customer’s home county to allow for some rollup-processing demonstration.

The other table, cc_transactions is the primary store for transaction data. It stores transactions per customer, bucketed by time; so it’s directly suitable to query for transaction history based on time criteria.

In this example a transaction status can be PENDING , COMPLETED , FAILED or REPAID ; any charges will have a negative amount while repayments will appear as positive ones. Also, a single customer can have multiple cards.

Transactions have a unique TimeUUID identifier that is a good clustering key as Cassandra will store them in their natural “wall clock time” order.

This is a decent data model, it scales appropriately, allows capturing transaction information as well as some limited querying ability. Additionally, only a handful of mutations need to be applied to each transaction (to change transaction status), which fits Cassandra’s sweet spot well1.

However it has several shortcomings: it is quite difficult to query the transactions by any other criteria than time. Customers would rightly expect to be able to use our application to access such features, however they are difficult to support. Additionally, it would be hard for the management of our bank to analyse customer behaviour and get aggregated statistics.

Cassandra on its own does not give you the tools to deal with these problems. Let’s see how adding Spark into the mix can address some of these challenges.

Use Case 1: Calculating current account balances

A reasonable request from any customer is to understand what is their current balance on each of their cards. When asked the question: given my customer id and card, how much money do I have? In this case our application would need to answer with a single number, the sum of all completed transactions and repayments on that particular card.

Option 1: Calculate on the fly

One possible way of implementing this requirement would be to read the full transaction history of that particular user and calculate the balance on the fly. It is easy to see that this would not be a good idea – we would need to query a potentially large amount of data from multiple partitions, which could be quite slow. Even worse, the number of these partitions would keep increasing with time, resulting in even slower and slower queries.

The only advantage of this approach is that we would always be guaranteed of getting a correct balance at any given time. The read complexity and latency however, makes it a bad choice.

Option 2: Store the latest account balance

An alternative approach would be to maintain a balance for each card at the time of capturing the card transactions. This might seem a better way, but also has several serious problems:

It would be necessary to read the current balance before every transaction in order to be able to adjust it 2 .

. Given Cassandra does not support ACID transactions, it’s difficult to ensure that the balance is correct: No card transactions are missed None of them are added up twice Concurrent updates don’t corrupt the result

Status changes in transactions need to be dealt with

It impacts write latency which is likely to be important if we want to capture a large amount of transactions

Option 3: Leverage Spark for periodic rollups

The third option is that we periodically (say at the end of every day) calculate the balance based on the transaction log for each card and store it for quick reads. This is effectively Option #1 done for all customers, except that we use Spark to do the aggregations, and then simply store the result in another Cassandra table.

The following table will be used to store the balances:

CREATE TABLE cc.cc_balance ( customerid text, card text, balance int, updated_at timestamp, PRIMARY KEY ((customerid, card)) );

This table definition is rather unsurprising: we store the single aggregated balance for each customer and card, and a timestamp of when this was calculated (we’ll get back to why in a minute).

All we need now is a way to efficiently calculate this value for all customers, which can be done with the following Spark fragment3:

/* 1*/ val includedStatuses = Set("COMPLETED", "REPAID") /* 2*/ val now = new Date(); /* 3*/ sc.cassandraTable("cc", "cc_transactions") /* 4*/ .select("customerid", "amount", "card", "status", "id") /* 5*/ .where("id < minTimeuuid(?)", now) /* 6*/ .filter(includedStatuses contains _.getString("status")) /* 7*/ .keyBy(row => (row.getString("customerid"), row.getString("card"))) /* 8*/ .map { case (key, value) => (key, value.getInt("amount")) } /* 9*/ .reduceByKey(_ + _) /*10*/ .map { case ((customerid, card), balance) => (customerid, card, balance, now) } /*11*/ .saveToCassandra("cc", "cc_balance", SomeColumns("customerid", "card", "balance", "updated_at"))

In this particular case, I’m using Spark’s original RDD API, which is the most flexible way of interacting with Spark. An RDD is a high-level abstraction (in fact not too dissimilar to Java 8 Streams in the way you interact with it). It assumes nothing about the structure of the data besides that it’s distributed across the Spark cluster and we have resilience in the system.

As you can see, I’m writing reasonably simple Scala code to do the processing; which results in quite a lot of freedom in what that processing can actually be.

Let’s take a look line-by-line:

Simple helper variable definition that I’ll use later in the code – declaring the statuses that I want to include into the calculations. PENDING or FAILED transactions will be excluded. Capturing the time when this code started running, we want to store this and only count transactions that happened earlier than this time. Create an RDD based on a Cassandra table cc_transactions of keyspace cc so that we can use it in Spark. This is a method supplied by the cassandra-spark-connector. Project Cassandra columns that we are interested in. A criteria to only include rows that have an ID that was captured before the Spark script started running. This initial where selection criteria will actually be pushed down to Cassandra and executed as a CQL clause. It’s worth pointing out that unlike usual Cassandra querying, there is no specification of a partition key – it’ll be applied to all partitions across the whole cluster. Further filtering the rows we are interested in. Unlike the previous ( where ) operation this is a Spark-level filtering, hence I can use the code-level Set-contains operation. keyBy results in a dataset structured in a way that is of interest to us: tuples in the format of (K, V) where K is a composition of customerid and card . This is the first step towards achieving our goal – to get the balance per customer and card. Note that this operation results in Spark reshuffling the data; which is unavoidable in our case. The Java object that we are operating on at this point is still of type CassandraRow – which is awkward to use directly in calculations. We therefore use map to extract the interesting bit: the amount. Sum up all the amounts per key to get the results. Technical mapping, all the operations so far have led to a nested tuple structure (K, amount) where K key is a tuple of (customerid, card) . This is not something we can directly save back to Cassandra, so just flatten it and add the current timestamp. We write the results back to Cassandra into the cc_balance so that our application can query it directly.

The application serving customers their desired information can now simply query cc_balance table to give the sum of all transactions – as accurate on the given date. If required it’s trivial then to query the last set of transactions (those that are logged later than the batch rollup) and include them in an on-the-fly manner. This would be a single-partition query, therefore efficient and scalable, unlike Option #1.

Spark is reasonably quick to execute batch jobs such as this. While it’s definitely not fast enough to serve direct end-user requests; there would be no problems with running the aggregation several times a day if required.

Use Case 2: Aggregating based on a Customer property

Another use-case might be to try and gain an understanding of how much people in different counties spend per month. In a relational database this would be a trivial operation, but would more than likely require a table join. Cassandra does not support joins, and aggregations against the full dataset are not efficient.

Again, we can use Spark to generate the data we are interested in. For this particular use-case I will use a different API of Spark: SparkSQL to demonstrate.

About Spark’s APIs Spark, since version 1.6, has two main APIs to interact with. The first, and original one is the RDD abstraction that I used previously to calculate account balances. It is a very flexible model that makes little or no assumptions of the data and allows almost arbitrary code to be executed during the data processing. This is great sometimes, but in many cases that level of flexibility is not required and could be traded for a more expressive description of the data processing. The Dataframe / SparkSQL API leverages the fact that many data sources actually used are tabular in nature – data is often organised in tuples (rows) of well-defined value types (columns) addressable by position. A Dataframe is just such a tabular dataset, and it can be underpinned by RDBMS tables, Cassandra structures or simple CSV files. As we have the abstract notion of a table, we can use SQL to query and transform it; which as a language, was designed to do exactly that. Choosing the best way of interacting with Spark and Cassandra can be down to many factors, but the SparkSQL route is compelling since we already have tables and it is easier to get started with due to the familiarity of SQL.

SparkSQL, as the name suggests uses a variation of the SQL language to perform queries across multiple Dataframes. The actual requirement is to sum up all COMPLETED transactions per county, per month. To do this we are going to need to join the cc_transactions table with cc_customers and do a group by / sum operation.

To access Cassandra tables in SparkSQL, we first have to define them in Spark’s Session:

session .read.cassandraFormat("cc_transactions", "cc") .load() .createOrReplaceTempView("spark_cc_transactions") session .read.cassandraFormat("cc_customers", "cc") .load() .createOrReplaceTempView("spark_cc_customers")

These declarations define the two tables in the Spark session with alias spark_cc_transactions and spark_cc_customers based on the respective Cassandra tables4.

Having defined these initial structures, we can now execute the following SparkSQL query:

/*1*/ session.sql( """ |SELECT c.county as county, t.year as year, t.month as month, -SUM(t.amount) as total_amount |FROM spark_cc_customers c |INNER JOIN spark_cc_transactions as t ON c.id = t.customerid |WHERE t.status = 'COMPLETED' |GROUP BY c.county, t.year, t.month """.stripMargin ) /*2*/ .toDF() /*3*/ .write.mode(Overwrite) /*4*/ .cassandraFormat("cc_county_statistics", "cc") /*5*/ .save()

Note that the bulk of the logic is encapsulated in Line #1 of the code – it’s a single SQL query that could just as easily have been executed on an RDBMS. This is actually sufficient to get the answers we were interested in.

The subsequent lines (#2-#5) are there to persist the results back to Cassandra. This did not necessarily need to be in Cassandra; other options are also available. The results could be saved into a relational database, dumped as a CSV or loaded directly into some application that could do visualisation. For simplicity I chose to write it back to Cassandra as it’s already available. The target table is cc_county_statistics , and the query below shows the type of data it would contain after the aggregation:

cqlsh:cc> select * from cc_county_statistics where county = 'London'; county | year | month | total_amount --------+------+-------+-------------- London | 2016 | 10 | 449738 London | 2016 | 11 | 1091426 London | 2016 | 12 | 2229660 London | 2017 | 1 | 224584 London | 2017 | 2 | 452422 London | 2017 | 3 | 682705

Use Case 3: Flagging suspicious transactions

In the third example, I want to show the possibility of joining a Cassandra table with an additional, external data source.

Let’s suppose that our bank periodically receives reports of lost credit cards. We want to identify those transactions that happened after a card was reported lost. The information about the lost cards is supplied to us in a CSV file:

card,status,reported_at 0053-5387-8458-8091,LOST,2017-03-03 0087-6863-7819-7570,LOST,2017-03-16 0106-0321-1773-7621,LOST,2017-03-11

Rather than loading this file into it’s own Cassandra table and querying that, we can instead use the CSV file directly as a Spark data source to find the suspicious transactions.

Spark Dataframes allows us to use a more programmatic (RDD-style) approach, instead of writing SQL. This example demonstrates this particular style of code.

First, let’s define the data sources again in the Spark session:

val ccTransactions = session .read.cassandraFormat("cc_transactions", "cc") .load() val lostCards = session .read.option("header", "true") .csv("/tmp/lost_cards.csv")

Now we can do the actual processing:

/*1*/ ccTransactions /*2*/ .join(lostCards, "card") /*3*/ .filter(row => getUUIDTimestamp(row.getAs("id")) >= getMidnightTimestamp(row.getAs("reported_at"))) /*4*/ .select("customerid", "card", "id", "amount", "reported_at") /*5*/ .write.mode(Overwrite) /*6*/ .cassandraFormat("cc_suspicious", "cc") /*7*/ .save()

The code above should hopefully be straightforward to understand – we join the transactions with the lost cards based on the card’s identifier, and find those transactions that happened after 12:00am on the day they were reported lost. The two helper functions used in Line #3 generate the appropriate comparable Unix epoch timestamps.

Lines #5-#7 write back the results to Cassandra.

Conclusion

With this post I have aimed to show some of the ways you can combine Cassandra and Spark to address some common challenges, which, if you were using Cassandra alone, would more than likely prove very problematic. Cassandra and Spark are technologies that fit together extremely well, both having a similar abstract notion of how data should be distributed and replicated across a set of computers.

Cassandra lacks advanced querying and data processing capabilities, while Spark on its own does not have a persistent data store. While the examples in this post are deliberately simple, adding Spark into the mix allows for much more complex processing.

Combining these technologies can be a great match for your processing needs. You can use Cassandra as a primary data store that real-time applications interact with, whilst at the same time running analytical processing or data rollups over the same dataset using Spark.

Whilst the recommended Cassandra deployment model is generally to physically isolate the workloads into a multi-datacenter deployment, the Spark processing is still able to run over the same set of data – replicated across the analytics side of the cluster and without any need of ETLing it between an OLTP and an OLAP store.

For anybody already using Cassandra, where you have more advanced data processing requirements, we would highly recommend you consider adopting Spark too.

Appendix: Configuring a Spark driver application

So far all the code examples shown assume a SparkContext , or Spark session already exists, and set up to be connected to Cassandra.

There is a little bit more setup work needed in the Spark driver applications to make this actually happen.

First off, in Scala, any driver application needs the following sbt dependencies:

libraryDependencies ++= Seq( "com.datastax.spark" %% "spark-cassandra-connector" % "2.0.0", "org.apache.spark" %% "spark-core" % "2.1.0", "org.apache.spark" %% "spark-sql" % "2.1.0" )

SparkContext is the entry point to using Spark and is relied upon by the RDD API. It needs to be initialised in the following manner:

val conf = new SparkConf() .setAppName("Calculate Balance") val sc = new SparkContext(conf) SparkSQL Sessions perform a similar role for the Dataframe API. They are initialised as follows: val session = SparkSession .builder() .appName("Calculate Average Spending Per County") .getOrCreate()

This is the minimum required code. Additionally, it is mandatory to supply a Spark master URI and settings for Cassandra connection, however, these can be specified at execution time. For this blog post’s examples, I used the spark-submit script.

To use spark-submit , the driver application needs to be packaged up in a jar file and submitted to Spark in the following way:

./spark-submit \ --master 'spark:// :7077' \ --conf spark.cassandra.connection.host=127.0.0.1 \ --class com.opencredo.spark.FindSuspiciousTransactions \ --packages datastax:spark-cassandra-connector:2.0.0-s_2.11 \ / /cassandra-spark-demo_2.11-1.0.jar

Please note the parameters passed in the command line: the Spark master’s URI and the Cassandra host specification. It is also required to specify the spark-cassandra-connector Spark package.

Notes

1: Many mutations on the same partitions results in fragmentation on Cassandra’s SSTable level which impacts read latency:

2: This is a well-known anti-pattern in Cassandra. Because of the lack of transactional guarantees and optimisation for write latency executing a read before mutating data may lead to incorrect results and increases latency

3: I believe that Spark’s Scala API is easier to interact with compared to the Java API, therefore I’m using Scala examples to demonstrate the solution. The same could be written in Java or Python as well.

4: I intentionally chose different names to avoid confusion, but the tables in Spark Session can be named anything – in fact calling them the same as their Cassandra counterparts is quite expressive

References