We, at Trainline, are delighted to announce Optimus, an open-source data store which provides transaction isolation semantics and version control mechanism on top of a traditional key-value store. Optimus is designed to be used by Data Scientists to store and retrieve reference data and model co-efficients calculated across millions of keys.Optimus is a key component of our data platform which is designed to enable Data Scientists to build and deploy data products with highly predictive features, driven by data, that save our customers time, hassle and money. Optimus is the data store that powers Trainline features like BusyBot and Price Prediction.

Motivation

Many of the data-driven services Data Scientists develop are simple lookups of some coefficient or a number of items predicted in advance for a given user. These factors are calculated at regular intervals, typically nightly or hourly, uploaded to a database and exposed via a REST based service. The service might have some additional logic, but typically the added value for the business is in the data itself.

During the MVP phase for our first data product, we started with a simple solution which involved a daily sync of the model data from an AWS S3 bucket to an AWS DynamoDB table. While this solution allowed us to get the MVP out of the door, we knew that we had to solve a few key issues:

Dirty reads — When a new set of keys are being loaded. Loading millions of keys to a database is a long running operation which could take hours to complete. During the load, the service would return inconsistent/partial data which leads to poor user experience.

Version Control — Rolling back required a ‘reload’ of the previous dataset. The service would return inconsistent/partial data during the roll-back, which could take hours to complete.

Flexibility — DynamoDB may not be the ideal backend for all datasets.

Solution

Optimus is our answer to the above issues. Optimus consists of a RESTful API to store and retrieve data. Transaction isolation semantics are also implemented in the API layer. Data is loaded into Optimus using an Apache Spark based job, which can load millions of keys. The diagram below shows a logical view of the application architecture.

Design

Here is a brief overview of the concepts/entities in Optimus:

Datasets — A dataset is a collection of tables which are semantically linked. A dataset may contain several tables which are updated (or rolled back) all together in the same version. It is NOT possible to rollback individual tables in a dataset.

A dataset is a collection of tables which are semantically linked. A dataset may contain several tables which are updated (or rolled back) all together in the same version. It is NOT possible to rollback individual tables in a dataset. Tables — A table is a collection of entries.

A table is a collection of entries. Entries — An entry is a key/value pair. Optimus provides an API to read individual keys as well as group of keys.

An entry is a key/value pair. Optimus provides an API to read individual keys as well as group of keys. Versions — Datasets are immutable by default and the only way to update a dataset is by creating a new version of the Dataset. A newly created version is effectively in a draft state (AWAITING_ENTRIES) and ready to accept new data. Once the entries are loaded, the version can then be SAVED at which point the version will stop accepting new entries. Any reviews or validations can be carried out at this stage. Once the version is SAVED it can be PUBLISHED or DISCARDED.To rollback the dataset to a previous version, it is sufficient to just re-publish the previous version.

Transaction Isolation using Data Versioning

While API versioning and deployment strategies to roll-forward and roll-back are common practice in SOA systems, we couldn’t find much support for similar strategies for data. Conventional key-value stores do not provide features which allow fine grained control over the publishing of large-scale data.

Some datasets have invariant relations between values of the same set which must be preserved even while a model is being updated. From a consumer point of view, we would like to either see all the old values or all the new values, and it should never be possible to observe two values from two different versions at the same time.

To achieve the above goal, we introduced the concept of ‘data versions’ which define the boundaries of when certain data is ready to be consumed.

The entire dataset must be updated when a new version of a dataset is created. It is not possible to update individual keys. Once the dataset is loaded, the version can be SAVED and PUBLISHED at which point, no more changes can be made. At this point, the system semantically replaces all keys values for a given dataset at once.

When a version is being published, in-flight read requests may return inconsistent results. It is possible to achieve `repeatable reads` consistency, by explicitly specifying the `version-id` during read requests.

Rollback

In some cases it might be necessary to revert to a previous

set of values. For example, metrics may show that the data product has been performing poorly since the latest model values went live. Unlike the rollback feature in ACID databases, we have to be able to efficiently and rapidly revert millions of values. This is similar to reverting a change in a Version Control Systems like `git`.

Optimus provides a single switch to control a logical rollback to a previous stable version.

Correctness and data integrity

When updating millions of individual values it is hard to establish whether all keys have been written and all their values have preserved their integrity. Optimus is designed to support customized data integrity and correctness checks. In the future, Optimus will provide out of the box strategies to verify data.

Storage agnostic

While our first implementation is based on DynamoDB as backend

storage, it is clear that different datasets will have different

access patterns and latency requirements. For this reason, the design favours a storage agnostic approach whereby, for a specific use case we could choose a completely different storage system such as Memcached or Redis to provide lower latency.

The choice of the backend storage system should be completely transparent from the user perspective and totally a service optimisation decision, rather than a Data Scientist concern. By keeping the choice of storage technology abstracted within the Optimus service, we could implement a hierarchical storage system where datasets are dynamically spread across different layers with different performance characteristics and where the implementation is completely transparent to the service users.

Model data upload service

Optimus includes an Apache Spark based ‘Loader’, which uploads all key-value pairs in a given datafile using the Optimus REST API.

Technology

Under the hood, Optimus is built using the following tools and technologies:

The REST API and the loader are developed using Clojure .

. The loader uses Apache Spark to upload large scale datasets while maintaining data integrity in the face of partial failures and retries.

to upload large scale datasets while maintaining data integrity in the face of partial failures and retries. DynamoDB is used as backend storage technology for the key-value store, metadata store and a basic queue that is used internally by Optimus to perform background tasks.

The diagram below shows a high-level overview of our current implementation of Optimus.

Getting Started

The code is available now on the Trainline public github repository here: https://github.com/trainline/optimus. All code is under the Apache 2 license.

There is also a website with further information and installation instructions. More content will appear here over the next few weeks.

We wanted to open source this tool as early as possible and then continue to improve it. We plan to add useful features in the near future, like data verification strategies and garbage collection. We are now using this open source version ourselves internally, not a separate fork. This means we will continue to add features and improve the code base — including improvements for the wider community such as reducing dependencies, simplifying setup and removing Trainline specific assumptions.

We genuinely believe this tool has great potential to help the wider community.

We would love to hear what you think. For feedback, help or suggestions, please contact: optimus@thetrainline.com