Optimizing some of our automated strategies requires us to develop a rich understanding of market dynamics on very short timescales. Price data as granular as one sample per minute is available through free tools like Enigma’s Catalyst, but given that variations on the order of milliseconds can meaningfully impact the profitability of our strategies, such data is too coarse-grained for the type of analysis we’d like to conduct. In theory, the highest resolution data we could hope to obtain from an exchange for a given trading pair would include every trade as well as every update to the order book, effectively allowing us to recreate the state of a market at any given point in time¹. Fortunately, this data is freely available, as many exchanges offer real-time feeds of market data via WebSocket APIs².

The WebSocket streams we use for data ingestion are often the very same streams responsible for providing real-time updates to the exchange’s trading dashboard.

This is the Binance trading dashboard for the BTC-USDT pair. We can see the order book updating on the left and recent trades appearing on the right.

WebSocket connections transmit data as discrete messages. We can inspect the content of individual messages as they stream into the browser. Here we see a message that contains a batch of order book updates:

The updates are expressed as arrays of bids and asks that were either added to the book or removed. Code running in your browser processes these updates, resulting in a real-time rendering of the market’s order book.

This Python script simply demonstrates that we can programmatically access the same stream of order book updates that we saw in the browser:

In practice, our system does not read a single stream, but rather thousands of different streams, covering various data feed types for all trading pairs across multiple exchanges³. Each stream is managed by a dedicated “worker.” Each worker builds up a buffer of incoming messages, then, after a fixed time interval, flushes the buffered messages into a file (thus emptying the buffer) and finally writes the file to Amazon’s S3 file storage. The buffer-flush cycle repeats.

Files in S3 containing trade data from Binance for the BTC-USDT pair from August 13th, 2018

Here’s what the content of one of these files looks like:

This file contains trade data that one of the workers ingested from a WebSocket stream, batched, and flushed to S3. Each row contains data for a single trade. You can see that the “trade_id” values increment by 1 — a quick sanity check that there are no gaps in this data.

Lots of data for lots of pairs

We can search for specific records within the millions of files we’ve written to S3 using another tool from Amazon called Athena, a serverless query service that spares us the setup and maintenance required for a traditional database. After a bit of basic configuration that essentially tells Athena what to look for inside these files and where to look for the files themselves, we can now search through our data using standard SQL syntax.

Using Athena to fetch all 239,945 trades on Binance for the BTC-USDT pair on August 13th, 2018, ordered by the trade timestamp (ascending)

Note the “directory” structure (S3 directories are actually just name prefixes) we use to write our files to S3:

This structure functions as an important optimization for when Athena retrieves the data. Athena has a concept of “partitions.” Partitions are simply mappings between virtual columns (in our case pair, year, month, and day) and the S3 directories where the corresponding data is stored. By pointing Athena directly to a particular subset of data, a well-defined partitioning scheme can drastically reduce query run times and costs. Our code automatically adds partitions when we start writing data to new S3 directories.

It’s worth taking a moment to appreciate the value proposition of S3 + Athena. These services are far cheaper and far more scalable than a bona fide SQL database, yet we can search for data using the same familiar query language. Granted, Athena can sometimes be slower than a proper SQL database and the complexity of queries is limited, but given that we mostly use it for ad-hoc investigations and fetching data sets for further analysis in Python, our workflow has been unencumbered by these minor shortcomings.

Speaking of Python, here’s an example of an interface we built to fetch records from Athena. It’s a thin wrapper around Amazon’s boto3 library with a few useful steps baked in.

This interface for retrieving Bittrex trade data takes a ticker pair, start date, and end date as arguments then returns the query results from Athena as a Pandas DataFrame. There’s also an optional argument to write the data to disk as a CSV. This query returned 126,434 trades from Bittrex for the BTC-USDT pair from August 1st, 2018 to August 3rd, 2018.

I’m glad I could provide a brief glimpse into a project I’ve enjoyed working on yet have had few opportunities to discuss (turns out real-time data ingestion doesn’t make for great small talk). In future posts, I’ll cover the motivation behind some of our design decisions, examples of specific insights we’ve been able to glean from these data sets, planned improvements, and more.