Monday, August 23, 2010 at 7:03PM

Cassandra is one of the hottest of the NoSQL databases. From a production DBAs perspective it’s not hard to see why: while some of the other NoSQLs offer more programming bells and whistles for the developer, Cassandra is built from the ground up for total and transparency redundancy and scalability, close to the heart of every DBA.

However, Cassandra involves some complex data modelling concepts – mainly around the notorious SuperColumn concept, and I don’t think I’ll ever understand it fully until I’ve played directly with some data. To that end, I thought I’d start by trying to model some familiar Oracle sample schemas in Cassandra.

Toad for Cloud Databases is releasing support for Cassandra early next month (eg September 2010), so I’ve been using that – as well as Java of course – to try to get some initial data loaded.

For other NoSQL databases, Toad for Cloud lets us create NoSQL tables from relational tables with a couple of clicks. Unfortunately, we can’t do that with Cassandra, since you can’t create a ColumnFamily on the fly. So my first Cassandra tasks was to write a simple program to take an Oracle table (or query) and create a matching column family.

Getting started

Getting started with Cassandra was surprisingly easy. I followed the instructions in http://schabby.de/cassandra-installation-configuration/ to install Cassandra on my laptop, and installed the hector Java interface from http://prettyprint.me/2010/02/23/hector-a-java-cassandra-client/.

Terminology in NoSQL can be confusing, with each NoSQL database using terms differently from each other, and all of them using terms differently from RDBMS. In Cassandra:

A Keyspace is like a schema

is like a schema ColumnFamily is roughly like a table

Things get very funky when SuperColumns are introduced, but lets skip that for now.

To create a ColumnFamily in Cassandra 0.6, we have to add its name to the storage-conf.xml file which is in the Conf directory and then restart Cassandra. In 0.7 there’ll be a way to do this without restarting the server.

Here is where I created a keyspace called “Guy” and created some ColumnFamilies to play with:

1: "Guy"> 2: "G_Employees" CompareWith= "UTF8Type" /> 3: "G_Employees2" CompareWith= "UTF8Type" /> 4: "G_Employees3" CompareWith= "UTF8Type" /> 5: org.apache.cassandra.locator.RackUnawareStrategy 6: 1 7: org.apache.cassandra.locator.EndPointSnitch 8:

Loading data

I wrote some Java code that takes a SQL statement, and loads the result set directly into a column family. Here’s the critical method (the complete java program with command line interface is here):

1: private static void oracle2Cassandra(Connection oracleConn, 2: Keyspace keyspace, String cfName, String sqlText) 3: throws SQLException { 4: int rows = 0; 5: ColumnPath cf = new ColumnPath(cfName); 6: Statement oraQuery = oracleConn.createStatement(); 7: ResultSet result = oraQuery.executeQuery(sqlText); 8: ResultSetMetaData rsmd = result.getMetaData(); 9: while (result.next()) { // For each row in the output 10: // The first column in the result set must be the key value 11: String keyValue = result.getString(1); 12: // Iterate through the other columns in the result set 13: for ( int colId = 2; colId <= rsmd.getColumnCount(); colId++) { 14: String columnName = rsmd.getColumnName(colId); 15: String columnValue = result.getString(colId); 16: if (!result.wasNull()) { 17: cf.setColumn(StringUtils.bytes(columnName)); 18: keyspace.insert(keyValue, cf, StringUtils 19: .bytes(columnValue)); 20: } 21: } 22: rows++; 23: } 24: System.out.println(rows + " rows loaded into " + cf.getColumn_family()); 25: }

The method take s a Oracle connection and a SQL statement, and pushes the data from that SQL into the Cassandra column family and keyspace specified. The first column returned by the query is used on the key to the Cassandra data.

Lines 6-8 execute the statement and retrieve a ResultSet object – which contains the data – and a ResultSetMetaData object which contains the column names. Lines 9-21 just iterate through the rows and columns and create entries in the Column Family that match. We use the Hector setColumn methodto set the name of the column and the insert method to apply the column value. Too easy!

Of course, I’d have no idea as to whether my job had worked if I didn’t have Toad for Cloud databases available. Using TCD, I can map the Cassandra columnFamily to a TCD “table” and browse the table (eg Cassandra Column Family) to see the resulting data:

I can even use SQL to join the Cassandra data to the Oracle data to make absolutely certain that the data transfer went OK:

It’s surprisingly easy to get started with Cassandra. Installation of a test cluster is a breeze, and the Hector Java API is straight forward. Of course, direct mapping of RDBMS tables to Cassandra ColumnFamilies doesn’t involve the complexities of advanced Cassandra data models using variable columns and SuperColumns. Next, I’m going to try and map a more complex ColumnFamily which maps to multiple Oracle tables – hopefully won’t make my brain hurt too much!

Toad for Cloud Databases is introducing Cassandra support in the 1.1 release due out within the next two weeks. Its a free download from toadforcloud.com