Basho Technologies announced the open sourcing of Riak TS 1.3. Riak TS is specifically geared for handling time series data – it supports fast write and query for time series data. In addition, Riak TS supports Data Aggregators and Arithmetic operations, integration with Apache Spark via the Spark connector, Client support for Java, Erlang and Python, and standard SQL based Query system. Riak TS 1.3 EE (Enterprise Edition) built on the open source version supports Multi-cluster Replication. A comprehensive list of features is listed in the release notes.

InfoQ caught up with Dave McCrory, CTO of Basho regarding the announcement.

InfoQ: Can you give a brief overview of the Riak suite of products and specifically about Riak TS? Will the development of the suite of products continue independent of each other?

McCrory: The Riak family of products is centered around Riak Core, which is an OSS Clustering solution that Basho developed over the last 7 years. Riak KV is built on top of Riak Core and provides a highly resilient and available key value store. Riak KV continues to evolve and focus on things such as data correctness and preventing data loss and corruption. Riak TS is a purpose built time series data store built off of Riak KV. We took all of the strengths of Riak KV and applied them to solving the issues customers were having with time series. So what exactly did we do? Here are some of the things: Faster write path for data

Schemas for buckets

Query planner and subsystem

Parallel extract from vnodes

Flexible Compound keys

InfoQ: Riak TS has been almost 18 months in the making. Can you talk about the journey?

McCrory: Early on we were seeing customers using Riak KV to store time series data. When we looked at what was required, we could see lots of customization was required to make things run smoothly. We also looked at the market for time series data stores and saw only a few solutions and all of them lacked enterprise production workload qualities. Either lack of scalable clustering or resiliency or they were very intensive to manage and operate, all of which made the existing time series solutions poor choices. We then had an architecture meeting and began discussing ideas on how we could tackle this problem. Ultimately one of our engineers had an interesting idea of distributing data around the hash ring by quanta (time range) and built a POC that seemed to work well. This began our journey of solving many of the harder time series problems. There were several other constraints we were dealing with such as performance requirements. We needed to support millions of data points per second on a cluster of say 8 machines. Achieving this requires profiling the flow of data from ingestion into the cluster to storage on the disk and then the path to read the data back. As we started to look at this we realized that we would need a dedicated team to focus on performance. Hence the performance team was created. The performance team was making discoveries about some of the inefficiencies that we had fairly early on, immediately following learning how to do end to end profiling. This gave us a group of things to focus on performance wise that would give us the greatest payoff. One of the major things that we found was that we were doing multiple encodes/decodes which we was consuming CPU cycles and adding a lot of latency so we began to work on eliminating them. Meanwhile we were looking at how time series could respond quickly to queries without requiring all queries to rely on secondary indexes. We landed on applying a schema to the data, because time series data had an inherent structure. This opened up quite a few doors including our ability to implement a query engine and language. Ultimately we decided to use SQL and not our own flavor of SQL as other vendors had because we found it to make things more complex. We also then had to build a query parser and planner to request data only from a single set of nodes that had that data vs either all nodes containing the data or from all nodes across the entire cluster. This proved to be challenging, but the resiliency and performance numbers speak for themselves.

InfoQ: Apart from IOT use cases what are other good examples of Riak TS usage?

McCrory: Many customers and prospects have been addressing general time based data problems, whether they be information reporting and auditing, recording scores and bets, or storing metrics. It seems like almost every week I learn of some new application for Riak TS.

InfoQ: : Is Basho the biggest contributor to Riak TS? Which other companies are contributing? What’s the objective behind open sourcing the code?

McCrory: Basho is indeed the biggest contributor to Riak TS, primarily because we had to make so many additions and changes to have a purpose built time series solution. We are currently talking to several companies about working on a series of capabilities to Riak TS to solve a large problem in the time series arena. As for the objective of open sourcing the code, we believe that we have a lot to offer the community in terms of innovative approaches, ideas, leadership in distributed systems and want to collaborate to build even better solutions. That process is almost always accelerated when you leverage open source as a path. We have a long history of open sourcing our software and gaining support from the community in creating better solutions.

InfoQ: The Apache Spark connector is a critical part in using Riak TS for Big Data solutions, correct? Can you talk about the connector and how it compares with other NoSQL connectors to Apache Spark?

McCrory: Apache Spark is definitely the leader when it comes to Big Data Analytics in the majority of customers we see. We do have customers that are applying Riak TS to Big Data solutions without Apache Spark, these are use cases that don't require deep analaytics, machine learning, and some of the other things that Apache Spark brings. That being said, the majority of our Big Data customers using Riak TS with our Spark Connector and Apache Spark. We have been working on the Riak Spark Connector for quite a while now and it has gone through at least three iterations to get to this point. One of the things that we have done is to take the parallel extract feature that we've built and allowed the Spark Connector to leverage this. By doing so, we are able to simultaneously pull all of the requested data from all of the nodes in parallel. This speeds up data into Spark and with the write performance improvements we've made, makes Riak TS a great solution when leveraging Apache Spark and the Spark Connector.

InfoQ: Can you provide the roadmap for Riak TS including supporting HTTP API security and the focus on IoT?

McCrory: We expect to have HTTP API security in the upcoming 1.4 release of Riak TS along with quite a few other features. Features including additional SQL command support, initial support for results streaming, and several other capabilities. The focus of Riak TS on IoT is centered around how we can provide a very easy to operate and use time series database. By focusing on IoT data where we see demands for higher and higher throughput requirements with consistently small chunks of data we will continue to optimize for performance and interfacing with other complimentary technologies. One area we are looking to now is in visualization of data and how we can partner or make Riak TS work with visualization solutions that are popular in Open Source and in the industry. Usually the point of gathering and storing all of this IoT data is to identify some action to be taken.

Getting started material is available in Basho docs.