This is the first installment in a series about using Scala for large-scale data stores and analytics.

Over the past several years, Scala has been used by multiple projects in the construction of large scale data stores and analytics platforms.

The BBC has used Scala, and the Scalatra HTTP framework in the construction of its internal RDF data store, which is used for querying linked data around BBC News and Sport articles.

Throughout 2014, there was a big upsurge of interest in Apache Spark, a data analysis toolset written in Scala. Spark has an HTTP interface in Spray and an Akka back-end for parallel processing.

McLaren Applied Technologies (MAT) has also used Scala, Scalatra and Akka as the basis of its data analytics platform.

Why are these projects choosing Scala?

To find out, we talked to Andrew Jayne, senior software engineer at MAT, about the experience of building a custom high-performance data store in Scala.

Based near London, McLaren Applied Technologies stems from parent company McLaren's decades of experience in Formula 1 racing. MAT split off to apply Formula 1 related knowledge in other industries.

Because of their F1 background, MAT have very strong skills in analysing and making quick decisions based on data, such as the time series data that stream off Formula 1 cars during a race. However, they've found that series analysis, often coupled with sensor instrumentation, can be applied to many things besides racing: they've advised companies in the manufacturing, transport, energy, and healthcare sectors.

To support this consulting work, MAT engineers have built a series data platform, which runs on AWS.

InfoQ: Can you provide an overview of what the MAT series data store is for, and how it's used?

Jayne: It's a high-volume, high-velocity ordered key-value store with read-your-writes consistency and a feature set specifically designed for real-time and historical series data (indexed by time, distance, depth, etc.) It enables us to do micro-batch and batch processing in a version controlled way and is used for the persistence of all series inputs and outputs to and from our data analytics via a REST API. An example is some work we are doing with GSK to monitor and analyse ALS patients - instrumented with sensors - through clinical trials enabling detection of improvements and deterioration to feedback into the trial.

InfoQ: Are you using the MAT series data store for Formula 1-related time series data, or is it more general-purpose?

Jayne: It was designed as a general solution for storing and querying series data based on our experience in the motorsport, elite sport, health and energy industries. We faced similar problems across projects and needed a storage solution that maintained data integrity whilst enabling us to solve them, e.g. what do you do when data originates from a device with clock drift? How do we ensure high fidelity if data is uploaded at low resolution due to bandwidth constraints? What happens when a backlog of data becomes available after an extended period of network outage?

InfoQ: What sort of datasets and workloads are you using it for?

Jayne: It is used for n-dimensional numeric data with a numeric, strictly increasing key. Current use cases include sensor data from instrumented patients, drilling operations and data centres in addition to the results from our data analytics.

InfoQ: Why Scala? Did you have a background in Java and move over? What attracted you to the language?

Jayne: The background of the team is polyglot. Until 2012 we were using Java, C# and MATLAB. With the release of Typesafe Stack 2.0 we adopted Scala as the language of choice for the backend of our services and web applications. The main motive at the time was scalability, for example simplifying concurrency through Actors, immutability and functional programming (FP). There were other incentives, such as being able to express programs more mathematically, pattern matching, and reducing boilerplate. However, there's no way we would have gone for Scala if we had to sacrifice Java interoperability. This FP approach was fresh, invigorating and presented us with more opportunities to innovate. It is a good fit for the culture and mentality of the team.

InfoQ: Why not build it in Java?

Jayne: We already had a component developed in Java on top of Hadoop which we could have built upon but the rest of our backend components were in Scala. We used this as an opportunity for some R&D and soon had a good idea of which libraries and technologies we'd be able to use to support the architecture we'd designed. With the experience gained we realised that what was currently built could easily be simplified with FP. Whilst Java is easy to use, its verbosity can get in the way of making the intentions of code clear. We've been using Scala commercially for years and have also found that there's great value in being able to talk to the people building the language and core libraries.

InfoQ: What didn't other time series databases give you? Why not use KairosDb, Druid, InfluxDb, OpenTSDB, or one of the existing proprietary TSDBs?

Jayne: Data integrity and data management are essential for our business and we needed some of the guarantees around data that we expect from our code. It was not enough just to take backups and record logs. As a result we designed a data store with commit-based revision control and tree-based organisation influenced by Git. Our acceptance criteria presented additional challenges, for example: Atomicity - it should not be possible to query the system in an inconsistent state.

Consistency and correctness - queries should be repeatable and return consistent, correct data.

Near real-time writes - support writing millions of measurements per second (from high and variable frequency sensors) and making the data available for querying within seconds of it being captured.

Query performance - support latency-sensitive data analytics by providing predictable and fast queries across entire channels (billions of datapoints), including: data preparation (i.e. interpolation), subsets (i.e. windows, ranges, offsets, limits), aggregations (i.e. averages, min, max, etc.), forecasting (i.e. extrapolation, regression, correlation

Language agnostic - REST API with ETag caching.

Common data formats - data should be compatible with existing tools that use csv, json, etc.

Deployment complexity - it should utilise existing cloud infrastructure (AWS) and enable new data stores to be created with minimal effort. Many of these have been addressed by existing TSDBs and some have solved these problems better. However, there are certain queries that we require that other databases do not support and are not easy to retrofit. The majority of TSDBs are built on top of another database system (such as HBase), which is actually more complex than required for our use case. By using a simpler proprietary storage mechanism that takes advantage of distribution we can arrange data such that is optimised for the queries we perform and keep system administration overheads and running costs at a minimum. Most importantly, none provide the revision control semantics around data that we expect in development and production.

InfoQ: Speaking of the REST API, you're using Scalatra for the REST HTTP interface. Why?

Jayne: It is well maintained and follows a sensible releasing strategy, something we'd been bitten by before when using other frameworks. The framework has a very simple DSL and allows us to just pull in the bare minimum - request handling and testing - and use our libraries of choice for things like serialisation. Being able to easily deploy into servlet containers is a big win and the performance is excellent.

InfoQ: You've mentioned Actors. How did the choice of Akka work out in practice? I've heard from people that love it, and others who say that their code became immensely simpler once they moved away from it. What were the challenges there?

Jayne: Akka is used in the data store to manage requests asynchronously since most of a request's lifetime will be I/O. It's not currently event-based I/O but we may consider Akka Streams for the benefit of its scheduler and overflow handling. We've used the actor model when a problem can be decomposed into a set of sub-systems each with its own state and where scalability is a consideration. Designing to be distributed by default means you can take advantage of location transparency to scale both vertically and horizontally. However, the loss of formal interfaces is unsettling since we lose those compiler checks. A compromise is formal protocols, keeping actors simple and making these explicit in our design and tests. Writing clear and concise tests can be tricky, and debugging is challenging in any concurrent system.

InfoQ: Can the datastore be clustered across multiple nodes? How well does it scale out? And is that related to the choice of Scala, or could it have worked in any language?

Jayne: Currently it's scalable in terms of volume and failover - API endpoints and data are co-located. The next step to improve scalability would be to remove this constraint so that we can use an elastic load balancer on the REST API and decouple it from the storage layer. The pace of development was very quick in Scala and some of the most complex processing is very clear when written in FP. Favouring immutability by default encourages cleaner programming patterns and makes system atomicity and consistency much easier to enforce. However, there's nothing that couldn't have been done in Java - it's just that this was the right tool for the team.

Dave Hrycyszyn is head of technology at Head London, a digital products and services agency in the UK. He is co-author of Scalatra in Action, and a member of the team working on the Scalatra micro-framework. He blogs on Scala, APIs, and data-related topics at constructiveproof.com.