This post is based on the experience I have got while taking part in a very interesting forecasting competition hosted by XTX. Participants were challenged by the task to forecast the future return of a (presumably) Forex asset based on the limit order book (LOB) data. No details of the asset or limit order book dates were disclosed as part of the competition.

As part of the competition, XTX provided their proprietary model development data. The data included 3 million tick records, with each record containing multiple levels of the bid and ask prices. The competition was designed in such a way that the full previous history of accumulated tick data was available at every point in time, but no historical returns can be used for the future forecast. This eliminated classical time series models and required participants to fully utilise predictive patterns of the limit order book.

In this post, I will focus on the application of Kalman filter to derive the implicit state of the LOB. A detailed description of the theory behind Kalman filters can be found in a broad range of academic resources (example).

Model framework

I followed a relatively standard approach to building a regression model:

XGBoost regression model as a core

Grid search of parameters based on walk forward cross-validation

Generation of additional dynamic factors based on historical tick data

I will omit details of XGBoost and cross-validation, all details can be found on my github. Instead, I will focus on the feature engineering technique based on Kalman filter.

Using Kalman filter to infer implicit flows of LOB

The LOB data comes in a form of arrays of bid/ask size and price, split into 15 buckets: bidSize[0..14], askSize[0..14], bidPrice[0..14], askPrice[0..14]. Refer to the plot below for cumulative Bid and Ask volumes in the LOB over a short period of time:

To track the dynamics of LOB over time, implied hidden cash flows between the LOB buckets are introduced. To simplify, all buckets [1..14] are merged into a single one. Hence the observations are described as vector [bidSize1_14, bidSize0, askSize0, askSize1_14]

The hidden states of the system are defined based on the following eleven parameters:

Four parameters defining the current volume in each of the buckets bidSize1_14, bidSize0, askSize0, askSize1_14; Even though these parameters are directly observed in the data, Kalman filter treat them as being affected by measurement and process error;

Four parameters defining an external cash flow in/out of the corresponding buckets;

Three parameters defining the following cash-flows between the buckets:

ask1_14 ⇆ ask_0 bid1_14 ⇆ bid_0 ask_0 ⇆ bid_0

The hidden states assume the existence of an implicit flow between the buckets, in addition to the flow from the external world. As we see further, the estimates of these flows are used as factors in the regression model.

Based on the states above, the 11-by-11 Kalman state-transition matrix is defined as follows:

In the state-transition matrix, rows and columns are ordered in accordance with the list of hidden states above. From row 5 onwards it is an identity matrix, representing that no changes in flows are expected. Rows 1 to 4 represent the transition of the current bid/ask volume. For example, the transition of BidSize for the buckets 1-14 over time t is as follows:

with ω​ representing the white noise process.

The Kalman filter algorithm is implemented in the code below:

The code returns the values of implicit flows that are further used as predictive factors in the core regression model. As an example, the first 20000 observations of the flow (ask_0 ⇆ bid_0) is represented on the time series below:

The new predictive factors are used in the XGBoost model and their relative importance is estimated based on the number of times they are used in the XGBoost decision trees.

The extract of the variable importance statistics is represented on the chart below, with the first feature representing the flow ask_0 ⇆ bid_0:

Conclusion

We observed an example of a factor that is generated based on the implicit dynamic cash flow of the asset’s LOB. Even though the flow is implicit and unobserved in the data, the use of the Kalman filter algorithm allowed to generate the factor with significant predictive power when used as an input into the core XGBoost algorithm.