This is the second in a new advanced series of posts written by Imanol Pérez, a PhD researcher in Mathematics at Oxford University, and a new expert guest contributor to QuantStart.

In this post Imanol continues the theoretical discussion of Rough Paths and Signatures and begins applying them within a machine learning framework, utilising scikit-learn.

- Mike.

In the last article the signature of a path was introduced. As we saw, the signature is an infinite sequence defined for a continuous path. Although the signature is a very abstract object, as we will only work with paths $X:[0,T]\rightarrow \mathbb{R}^d$ with bounded variation (that is, paths for which the sums $\sum_{i} |X_{t_{i+1}}-X_{t_i}|$ are bounded for all partitions $0\leq t_0\leq t_1\leq \ldots \leq t_N\leq T$) the signature is just defined as a sequence of integrals, which are understood in the usual Riemann-Stieltjes way. In the previous article we also gave meaning to the signature of a stream of data. We will now exploit all those tools in order to build a very general machine learning model that will then be applied in a number of very different situations.

Supervised learning

A very broad class of problems in machine learning is supervised learning. Essentially, the objective of supervised learning is to learn a function that maps an input to an output. This is done using a training dataset of known pairs of input-output. It is crucial to make a good selection of inputs, known as features, in order to construct a model from these features. These features should describe the object that is being studied accurately, but at the same time very high dimensional features cause computational and overfitting problems.

Suppose, for instance, that we want to build a classifier that predicts if a person is male or female. Height may be a good feature we could use, since men are generally taller than women, but nationality is not, since all countries have, roughly speaking, the same number of men as women. Therefore, nationality is not a feature that describes accurately the gender of a population of individuals.

Signatures as features

In this section we will show that signatures can be effective features that can be used to create a machine learning model. We will show that, as in the previous example, the signature of a path is a good description of the path itself. Ideally, we would like to have a one-to-one correspondence between paths and their signatures. That is, we know that given a path of bounded variation we can construct the sequence of iterated integrals that define the signature. We would like to show that given one of these sequences of iterated integrals there exists a unique path with bounded variation whose signature is that sequence.

This is not true in general, but we do have a result that provides some notion of uniqueness:

Theorem 2.1. The signature of a path with bounded variation is unique up to tree-like equivalence.

Although we will not go into details of defining what up to tree-like equivalence means (we refer to [2] for a deeper discussion on the topic), this result does assure that the signature of a path is unique, in some sense. Therefore, the signature of a path is a good description of the path itself.

However, we cannot work with the full signature in practice. Computers can only store a finite number of terms of the sequence, so we can only work with the truncated signature defined on the previous article. Therefore, although the signature of a path describes the path well, the truncated signature may not, as the terms that are dropped could store a lot of information. The following result, however, guarantees that this is not the case.

Theorem 2.2. Let $X$ be a $d$-dimensional path of bounded variation. Then, given $1\leq i_1,\ldots, i_n\leq d$, $$\left \lVert \underset{\substack{0 < u_1 < \ldots < u_n < T}}{\int\ldots\int} dX_{u_1}^{i_1}\ldots dX_{u_n}^{i_n}\right \lVert \leq \dfrac{\lVert X \rVert_1^n}{n!}$$ with $$\lVert X \rVert_1 := \sup_{\{t_i\}\subset [0,T]} \sum_{i} |X_{t_{i+1}}-X_{t_i}|,$$ where the supremum is taken over all partitions of $[0,T]$.

The theorem suggests that high-order terms of the signature have factorial decay. That is, if we only store the first few terms of the signature and drop the higher order term, since the terms that were dropped have factorial decay we will not lose a lot of information. Hence, the truncated signature of a path does describe the path well, and we can therefore use the it as a feature of the path.

A machine learning model based on signatures

Suppose we have a training set of input-output pairs $\{(X_i, Y_i)\}_{i=0}^N$, with $X_i$ a path of bounded variation and $Y_i\in \mathbb{R}$. We want to learn the function that maps each input to the corresponding output. Therefore, we will assume that inputs and outputs are related via an unknown function $f$ as

$$Y_i=f(X_i)+\epsilon_i$$ where $\epsilon_i$ is white noise. Since, as we have seen, the signature of a path is (in some sense) unique, we will look for a continuous function $f$ that satisfies

$$Y_i=f(S(X_i))+\epsilon_i.$$

Needless to say, finding such a function $f$ is not a trivial task. The following theorem guarantees that it suffices to look for linear functions $f$.

Theorem 3.1. Let $\mathcal{V}^1([0,T], \mathbb{R}^d)$ denote the space of $d$-dimensional paths with bounded variation defined on $[0,T]$. Let $S(\mathcal{V}^1([0,T], \mathbb{R}^d)):=\{S(X):X\in \mathcal{V}^1([0,T],\mathbb{R}^d))\}$ and $S_1\subset S(\mathcal{V}^1([0,T], \mathbb{R}^d))$ a compact set. Then, given $\epsilon > 0$ and a continuous function $g:S_1\rightarrow \mathbb{R}$, there exists a continuous linear function $L$ such that $$\lVert g(x)-L(x)\rVert<\epsilon\quad\forall x\in S_1.$$

See [1] for a proof of the theorem. This result implies that the function $f$ we are looking for can be approximated arbitrarily well by linear functions. Using all these results, we can create the following model:

Given a training set $\{(R_i, Y_i)\}_{i=0}^N$ of input-output pairs, where $R_i=\{(t_{ij}, r_{ij})\}_j$ is a stream of data, we construct a new training set $\{(X_i, Y_i)\}_{i=0}^N$ embedding the stream $R_i$ into a continuous path as discussed in the previous article. Using the new training set $\{(X_i, Y_i)\}_{i=0}^N$ we compute the corresponding truncated signatures of order $n\in \mathbb{N}$, $\{(S^n(X_i), Y_i)\}_{i=0}^N$. Notice that the truncated signature can be seen as a vector. Therefore, we can apply linear regression against the truncated signature using an appropriate linear regression algorithm.

Implementing the model in Python

We will now see how one can implement the model in Python. First, we will import all the libraries that will be used:

# sig_learn.py import numpy as np from signature import * from sklearn import linear_model import numbers

The package signature computes the truncated signature of a stream of data. An open-source Python package can be found in this GitHub repository. The package scikit-learn will be used to apply regression algorithms.

We will now create a class called sigLearn , that will be the core of the model. This will be constructed as follows:

# sig_learn.py class SigLearn: def __init__(self, order=2, alpha=0.1): if not isinstance(order, numbers.Integral) or order<1: raise NameError('The order must be a positive integer.') if not isinstance(alpha, numbers.Real) or alpha<=0.0: raise NameError('Alpha must be a positive real.') self.order=int(order) self.reg=None self.alpha=alpha

The constructor will take two parameters: order , which will determine the order of the signature that the model will consider, and alpha , a parameter that will be used in the regression algorithm.

The training function can now be defined as follows:

# sig_learn.py def train(self, x, y): ''' Trains the model using signatures. x: list of inputs, where each element of the list is a list of tuples. y: list of outputs. ''' # We check that x and y have appropriate types if x is None or y is None: return if not (type(x) is list or type(x) is tuple) or not (type(y) is list or type(y) is tuple): raise NameError('Input and output must be lists or tuples.') if len(x)!=len(y): raise NameError('The number of inputs and the number of outputs must coincide.') ### X=[list(sig(np.array(stream), self.order)) for stream in x] self.reg = linear_model.Lasso(alpha = self.alpha) self.reg.fit(X, y)

As we see, the training function takes two arguments. The argument x is a list of inputs, where the inputs are streams of data (in the form of lists of tuples). The argument y , on the other hand, is a list of outputs that correspond to each input. After checking the type of x and y , the function builds a list of truncated signatures in order to apply Lasso (a regression algorithm) with parameter alpha .

Finally, we may build a predictor function that predicts, given a new list of inputs, the corresponding list of outputs:

# sig_learn.py def predict(self, x): ''' Predicts the outputs of the inputs x using the the pre-trained model. x: list of inputs, where each element of the list is a list of tuples. Returns: list of predicted outputs. ''' if self.reg is None: raise NameError('The model is not trained.') X=[list(sig(np.array(stream), self.order)) for stream in x] return self.reg.predict(X)

The complete code will then look as follows:

# sig_learn.py import numpy as np from signature import * from sklearn import linear_model import numbers class SigLearn: def __init__(self, order=2, alpha=0.1): if not isinstance(order, numbers.Integral) or order<1: raise NameError('The order must be a positive integer.') if not isinstance(alpha, numbers.Real) or alpha<=0.0: raise NameError('Alpha must be a positive real.') self.order=int(order) self.reg=None self.alpha=alpha def train(self, x, y): ''' Trains the model using signatures. x: list of inputs, where each element of the list is a list of tuples. y: list of outputs. ''' # We check that x and y have appropriate types if x is None or y is None: return if not (type(x) is list or type(x) is tuple) or not (type(y) is list or type(y) is tuple): raise NameError('Input and output must be lists or tuples.') if len(x)!=len(y): raise NameError('The number of inputs and the number of outputs must coincide.') ### X=[list(sig(np.array(stream), self.order)) for stream in x] self.reg = linear_model.Lasso(alpha = self.alpha) self.reg.fit(X, y) def predict(self, x): ''' Predicts the outputs of the inputs x using the the pre-trained model. x: list of inputs, where each element of the list is a list of tuples. Returns: list of predicted outputs. ''' if self.reg is None: raise NameError('The model is not trained.') X=[list(sig(np.array(stream), self.order)) for stream in x] return self.reg.predict(X)

In the next article we will make use of this package in concrete practical applications to quantitative finance, to check the power of the model.

Article Series

References