At this point it was clear that we’d have to get really good at thinking about risk if were to have any chance at this. Almost all of this was new to us: we’d had course-work in computer security, and I was tangential to the security team while I was at Facebook, but none of us had ever thought about price risk, handling real money with algorithms, or operating in an environment in which you are constantly under attack.

Around this time, a friend of mine recommended I read The Black Swan, a book by mathematician-trader Nassim Nicholas Taleb. Black Swan examines the good and bad ways that human beings think about Extreme Events: low-probability, high impact occurrences such as market crashes, bankruptcies, natural disasters, and engineering failures. The book formalized many of the intuitions around risk that we used early on.

Taleb’s work is expansive, but if we take only the most immediately relevant bits we might come up with a quick summary:

In any domain that contains extreme events, it is a mistake to bank your survival on your ability to precisely predict and avoid those events. Our knowledge of the world is imperfect, and there will always be an extreme case that is not in the historical data. Rather, we should focus our attention on the impact such events would have were they to happen, and structure our systems such that they can survive, absorb, and even grow in their presence.

Applying this to our situation, we decided that the only way we do this was if we were willing to get hit, possibly hard, possibly often. We’d be smart and try to cut down the number of extreme events we were exposed too, but we wouldn’t put any money on our ability to avoid them. In fact the opposite: we would expect the bad thing to happen, and for each threat under consideration, we’d focus on making sure could survive the blow when it came. If we’d had a mantra it would have been this:

Expect the worst to happen. Plan how to survive it.

From this idea flowed hundreds of decisions about how to run the business, from how we chose what strategies to run, to how we set up our monitoring systems, to how we set up our information security. The starting point for how to approach any risk in our company was the assumption that our worst fears would come true.

So here’s the gotcha line: we were right. Of all those things in the threat model, a lot of them really did happen, including four exchange failures and one critical trading bug (none of our machines were ever compromised). But none of those moments were fatal, in fact, the worst caused a loss of 4%. We made it through the minefield bruised but undefeated.

Here are two stories about specific threats we faced and how we survived when disaster struck.

Case Study I: The day a bug almost gave away all our money

One night in 2015, I was winding-down watching Buffy the Vampire Slayer at about 2am. My phone buzzed and I glanced over to see the home screen covered in PagerDuty alerts — I was on call that week. I got over to my desk quickly and saw that we were bleeding money at a disturbing rate: about 0.1% per minute. We’d be down 10% by the end of the hour. It was instantly clear that this was the worst trading error we had ever encountered, and if we didn’t stop it the company would be dead by the morning.

Of course, that didn’t happen. As soon as I understood the effect, I shut down all trading, closed our open positions, and began to survey the damage and diagnose the issue. We had lost a total of 0.3%. The next day, we discovered that the cause of the issue was a bizarre interaction between one of our redundancy systems and unexpected time-out behaviour from one particular exchange. It was a complex set of interactions that could not have been predicted ahead of time. We fixed the issue and were trading full-bore again by the end of the day.

There are two reasons why we survived that day and did not become a mini-Knight Capital. The proximate cause was that we had obsessive, redundant, default-fail monitoring of every aspect of our trading systems. The issue that occurred was detected by three different monitoring processes on two different machines. If my phone had been dead or I otherwise wasn’t able to catch the issue, within five minutes one of my colleague’s phones would have been jumping off the nightstand.

But the ultimate cause that we woke up that day down 0.3% and not 20% is that we expected that one day we would push a critical trading bug into production. Expect the worst to happen. Plan how to survive it. We knew that we were fallible, and that engineers with far more experience than us had blown up their companies with such errors. We didn’t know how or when, but we knew there were things we could do to minimize the damage when it happened: catch it early, have redundancies, have simple kill-switches ready. In the end, this was a scary moment but it wasn’t a near miss. We were just prepared.

Case Study II: Trusting people you don’t trust

Every month a new headline like this

Of any of the items in that threat-model above, the one that kept us awake at night was the fear that an exchange would blow up while we had a pile of cash sitting with them. Exchanges were dropping like flies at the time. Mt. Gox exploded in 2014. BTC-E, one of the biggest exchanges of the day, was likely run by the Russian underworld. In order to do the trading that we wanted to do, we had to hand off almost all of our funds to organizations like this.

So we went back to first principles and figured out how to survive. There were two approaches to the problem of exchange failures, whatever their cause: try to avoid them, and try to survive them.

Avoiding failures meant predicting them, and we already knew we couldn’t do this precisely (which exchanges will fail at what times). What we could do was make some negative bets: “this exchange is likely to fail sometime in some way” and avoid those exchanges altogether. To do this we came up with a checklist of positive conditions that any exchange would have to meet in order for us to trust them with any amount of money. Most of these rules were simple and easy to check, like:

Must have a strong relationship with a credible bank Must have fiat currency deposits and withdrawals Must have a human contact we can get on the phone within a day …

Some of these might seem trivial on their own, but the advantage of checklists is that they can build a high-information picture of a system with low cognitive load. Consider the number of plane crashes or operating-room mistakes that have been avoided using checklists. For our purposes this meant that whenever we started thinking about a new exchange, the question was simple: does it pass the checklist or not? You’re much less likely to make a dumb mistake answering a simple question than a complicated one. Similarly, we could regularly audit our exchange roster and boot any one which fell below the waterline.

But we weren’t going to bank our survival on avoiding exchange failures. Expect the worst to happen, plan to survive it. We still had to be prepared to survive if and when an exchange got hit while we had funds at risk.

We decided to do this by setting strict limits to how much money could be kept in each account. Working backwards from the assumption that we would get hit, the first conclusion was we could never have a single point of failure on our exchange roster. Even if the most trusted exchange fails with a complete loss of funds we should be able to continue operating. So we set a single global restriction, that we’d never have more than x% of our assets in any account, and set x such that the loss would hurt bad, but we could survive.

Then we went through each exchange individually and set an even more restrictive limit based on any extra information that wasn’t already captured in our trust checklist. Sometimes this was something specific, sometimes it was semi-inside information (the crypto world was small at the time), and other times it was just our own intuition.

It was always tempting to break these limits. Every day there was some profit somewhere that we weren’t taking, only because we refused to risk more assets on that exchange. But we stuck to our account limits religiously, because that was the only way this worked. Fraud often looks like opportunity, and many of those temptations were traps of one kind or another. If we compromised enough times based on short-term emotions, we knew we’d eventually find ourselves caught out with paper profits on a dead exchange.