If Data is the new Oil, Fake Data is Renewable Energy

Good financial data is expensive, and there isn’t that much of it. For the purpose of satiating my curiosity, generating data is fine. Also, it gives me more control over how hard the problem is, so I can slowly escalate.

I made a magic function that generates a market. Here’s what the market looks like with 10 stocks

A make believe market of ten stocks

I think this market is easier than the real market by design, some stocks are really obviously correlated, and there is a strong autoregression thing going on. This is by design, I’ll get to hard markets eventually but better to see if the algo can learn something basic first. Just so you believe I have a magic function, here is another market I made

Here’s the code that makes these nice markets

I actually started with something even simpler, I just used a sine wave to see if I could get the model to learn at all. Once I got that running, i injected a bit of noise. This is invaluable, its much easier to find bugs in a dead simple, deterministic environment, then in one that is generated by weird random math and code that is probably wrong.

How I Satisfied my God Complex (by Building an RL environment )

Reinforcement learning is all about an agent interacting with an environment, as if James Bond joined Greenpeace. That’s a fancy way of saying that the environment tells the agent what the state of the world is, the agent says what to do, and then it gets a reward and uses it to improve.

Building the environment lets you play god, and playing god helps one realize why the world is so messed up. It’s really hard to get everything right, its really easy to make small mistakes in the environment that ruin everything and are hard to find.

Anyway, in my role as god I modeled the world as follows; The environment tells the agent what the price of each stock is, and some price history and what the current portfolio is. The agent gives back the new portfolio it would like to be in.

Then the environment takes a step forward in time, and gets into the agents desired portfolio, accounting for transaction costs. It’s a bit contrived, because the agent is always dealing with percentage of the account, while the environment has to translate that into dollars and shares, adjust for rounding, take transaction costs and return money due to rounding errors (otherwise we’d get negative reward for no reason).

Here is some ) code for an environment that does this

Yo, did you find a bug in that ? Let me know, leave a comment or a pull request.

The actual learning stuff

So RL breaks down into two flavors, policy methods and value methods. policy methods figure out what I should do next. Value methods figure out what the best thing to do is. What’s generally popular and tends to work quite well is to mix those two, with the modern variants of that being forms of actor critic methods.

I think value estimation is suspicious in finance, unless you do a really clever job of defining what the returns are. But if you are half-assing it and using returns or risk adjusted returns for a reward, then it’s a super noisy signal. That’s why I am suspicious of value functions in this context. (The astute reader will read the QLBS paper which does beautiful things with a value function, because they didn’t half ass the reward function)

Anecdotally, when I worked with prices generated by a sine wave, actor-critic methods (that approximate the value of a state) worked much better than pure policy (e.g. reinfoce), but once I moved into noisier data they seemed make things worse not better.

The internet is filled with endless troves of information about reinforcement learning algorithms, and github has endless troves of implementations. In fact, the code I talk about here is on github, and I actually copied it from Denny Britz’s repo.

Also, while the idea behind RL are really deep, their application is very much about good engineering and adapting the reality of your problem to them. So for those reasons, this section stays short.

Portfolios as Actions

One thing I did do that was actually interesting, at least in the sense that I thought of it myself, was have the agent output an entire portfolio as an action. This needs a bit of motivation.

I wanted my agent’s actions to be the weights of a portfolio over n stocks and cash (so n+1) .

Most of the RL wisdom on the internet says to sample from a Gaussian when you want continuous actions. If you want multiple continuous actions (weights of a portfolio) then you need to sample from a multi-variate Gaussian.

Gaussians have fancy names but all sorts of unsatisfying properties, they are like the Kardashians of probability distributions. I want each stock to be assigned a weight, between 0 and 1 such that all the weights sum to exactly 1. A multovarate Guasian gives me numbers between — inf and inf with no constraints on there sum. This is not satisfying. In fact, it’s worse than not satisfying, trying to overcome it by say softmaxing (that’s a verb now) the output makes the agent virtually untrainable.

Besides that, the multivariate Gauisan is paramaterized by the means of each variable and their covariance matrix. That’s alot of paramaters to calculate and some fancy programming to get the output of a deep layer into a positive semi-definite symetric matrix.

So between those two problems, I’d really rather use something more fitting to the problem

I Hate the Dirichlet Function and Love the Dirichlet Distribution

I failed calc 1 because of a question about the Dirichlet function

Luckily, there is a much more fitting distribution to sample from. The Dirichlet(K) distribution is a distribution over the K-Simplex. A portfolio of n stocks and cash is a point on the N+1 Simple so sampling from the Dirichlet(N+1) distribution is sufficient to get a portfolio.

While this isn’t built into Tensorflow, it is built into Tensorflow probability so this whole operation is as simple as

It’s that easy.

The Problem With My Idea

My idea of sampling from the Dirichlet distribution was awesome in theory and has so far proven to be utter crap in practice. There are two phenomena I am not happy with.

First, at every step we are sampling a new distribution which means we are changing positions. This is inefficient, but unavoidable when you sample from a distribution. This is compounded by the fact that we get negative reward for switching portfolios (transaction costs) but don’t have a mechanism to stay in the same one.

Second, it seems dumb to sample a new portfolio at every step anyway. I think we should be sampling a decision to change portfolios and then sample a portfolio if necessary. I have a sense that we can use the variance (or norm of the variance vector) of the distribution to make that decision.

I don’t really know. This is my most interesting problem right now. To recap, it mathematically does what I want and practically does absolute rubbish.

Other Problems

My other problem is the reward I’m giving the model. For now it’s just the log returns of the account (change in stock value — transaction costs). I think in the multi stock world it’s too noisy, but even in the single stock sine wave world it was a bad reward. The agent learns that the easiest way to make a buck is to buy maximum and YOLO that the market goes up (did you know that YOLO is a verb?) .

This is pretty easy to fix, but I am lazy. High five to whomever makes differential Sharpe ratio rewards in the environment (Section 2.C here).

Again, I probably have load of bugs. I did this code for fun and didn’t write tests, and honestly, there is no domain I’ve worked in that needs tests more than RL. Every bug is as bad as a possible race condition and worse, because often you don’t know if its a bug or your just asking more of the system than you can reasonably expect. Getting this stuff to work and trusting it’s results are exercises in humility which expose every engineering bad habit and incorrect assumption you have.