The hot topic in the Ethereum space this week is bugs. Or rather, one bug. An attacker stole 3.6M Ether from TheDAO, using a variation on the “race-to-empty” bug that was recently publicized. While this is not the first bug discovered in a smart contract, it is so far easily the most costly.

In his response to the bug, Slock’s COO expressed shock, referring to it as “unthinkable”, and pointing to the “thousands of pairs of eyes” that somehow missed this. It’s certainly hard to blame anyone for being shaken by the sudden disappearance of tens of millions of dollars. However, this natural reaction hides the simple truth that anyone who has dabbled in programming knows: bugs in programs are far from unthinkable — they are inevitable.

Bugs are everywhere

Programmer lore holds that the average number of bugs in a program is somewhere in the range of 10–50 per 1000 lines of code. While I wouldn’t bet my life on those exact numbers, they capture a crucial truth, which is that bugs are the rule, not the exception. I haven’t tried to count the lines of code in Slock’s “Standard DAO Framework”, but it is at least several hundred, meaning that our programmer lore suggests we shouldn’t be surprised to find double-digit numbers of bugs in it.

And that’s just the application itself. The worst bugs often come from the tools you are using. The DAO code was written in Solidity, and the Solidity Compiler is itself thousands of lines long. If we apply our rule of thumb again, that suggests dozens or hundreds of bugs may be lurking there, any of which could have affected the DAO’s behavior.

All of this to say: it was inevitable that the DAO would have bugs. The only question was how long they would take to be discovered, and who would get there first.

So what do we do?

In light of this fact, some have decided that Ethereum is doomed. Others believe that while Ethereum is not quite doomed, the Solidity language is too unsafe to use. I am not quite so pessimistic. Yes, the DAO had a bug. Yes, the next DAO will too. But while bugs are a fact of life, the disasters they cause need not be.

In the rest of this post, I’ll be looking at some strategies for writing fault-tolerant smart contracts. We cannot avoid bugs, but we can find strategies to mitigate them. If we can find ways to defend our contracts from their inevitable bugs, Ethereum — and even Solidity — may yet be saved.

Strategies for fault-tolerant smart contracts

All of the code in this post can be found in this Github repo, along with some tests verifying the basic functionality. I’m offering a 0.5Eth bounty for each significant bug found in any of the code. I’m sure there will be some!

The following is the (buggy!) contract we’ll be using as the basis of our examples. It implements a simple Ether-backed token. You can deposit Ether in exchange for BrokenTokens, transfer them to other users, and cash them back out for Ether. This pattern is very similar to Maker’s Ether wrapper which was recently found to be buggy, and has some features in common with the DAO as well.

There are two vulnerabilities (that I’m aware of!) in these 26 lines of code. One is in the deposit function, which lets you specify how many tokens you will receive, and doesn’t check that you deposited the correct amount of Ether. (Instead, it should just give you tokens based on the actual amount you sent.) The other one is in withdraw. It has the same race-to-empty vulnerability found in the Maker wrapper and the DAO, meaning that a malicious user can drain all of its Ether with only a small number of tokens.

If we’re the unfortunate authors of this contract, we would like a way to salvage the situation. If possible, we’d like to find the bugs before attackers do. If not, then we’d like to minimize the damage the attackers can do when they get there first. And either way, we’d like to be able to react to the bug effectively, before or after it’s exploited. Let’s take these three goals in reverse order.

Reacting to bugs

In some cases, the good guys get lucky, and find the bug first. This happened to Maker, who found their bug before any bad guys got to it. Even if not, the attack may happen slowly enough for a response, as happened with the DAO.

When we get a chance to respond, we need to be able to act quickly to preempt the danger. Unfortunately, in both of the cases I mentioned, there was no way designed for them to do so. Instead, they had to launch their own attack, and steal the Ether before anyone else could. While this worked, we can’t rely on being able to white-hat our way out of all bugs. Doing so may not always be possible, whether due to time pressure or special requirements for exploiting the bug.

Another obvious response would be to upgrade the contract and fix the bug. It’s out of the scope of this article to discuss how to write upgradeable contracts, but Elena Dimitrova has written an excellent post on one method for doing so, and I intend to write another in the coming weeks about another method which allows for upgradeable contracts with a constant address. If we want to be able to fix bugs, some form of upgradeability is necessary.

However, in many cases upgrading the contract would take too long to serve as an emergency response. The DAO had an upgrade path, but it required a minimum of 14 days of public debate. Plenty of time for an attacker to notice the upgrade, figure out what bugs it fixed, and exploit them. And certainly not good enough if the attack has already started.

Instead, a good option may be to have an emergency stop. Much like an emergency stop on a piece of mechanical equipment, this brings everything to a halt, preventing further damage from being done. Here’s an example based on our buggy token:

The crucial lines are 7 and 8, which define two types of function for this contract. The first are functions which are shut down if the emergency stop is pulled, limiting potential damage. The buggy deposit function is one of these. If it starts to be exploited, or a white hat finds its bug first, it can be shut down so it doesn’t do any (more) damage.

The other type of function is one that is only active in case of emergency. This allows you to specify a small set of simple actions that can be taken to deal with the danger. In many cases, this should be your upgrade path: shut everything down until the bug is fixed. If you don’t have an upgrade path, then it can instead be a function that allows you to gracefully wind the contract down, in this case by letting everyone withdraw their money. (Note that if the withdraw function is buggy, this won’t help much! But for bugs in the rest of the code, this will allow you to prevent further damage.)

Mitigating the damage

Now we have a couple strategies for reacting to bugs, but reacting is no good if the attacker has already stolen everything. This section therefore offers some strategies for mitigating or even preventing the damage, even if the attackers get there first.

The general strategy here is what Vitalik calls “defense in depth”. We want to have multiple layers of security working together, so that even if one part fails, the others will kick in to limit the damage.

If we’re dealing with outflows of money, an obvious strategy is to try to limit the amount that can be lost at any one time. The following CircuitBreaker contract allows money to flow through it freely in most cases, but if it notices too much money moving at once, it will pause the flow to allow it to be looked at. The curator (who may be a person, a group of people, or even an entire DAO) can cancel large transfers if they are fraudulent.

To use this contract, configure it with the parameters that are right for your system, and pass any out-going flows of money through it. You can imagine if this were attached to the DAO, how much smaller the losses could have been! (In the case of the DAO, reasonable parameters might have been something like 100,000Eth every 24 hours. This would allow the vast majority of splits and a good chunk of proposals to go through normally, while large splits and proposals would be unlikely to be severely inconvenienced by waiting a day.)

Of course, there are many variations on this general concept. Instead of requiring a waiting period, it could require the active approval of the curator. It could look at a percentage of the total capital rather than a fixed amount. It could calculate a moving average instead of a fixed window of time. Regardless, it is valuable to have something watching for unusual activity and reacting to it.

A related strategy is to try to notice unusual behavior by checking properties that are expected to be invariant. Of all the strategies in this post, this may be my favorite, as it has the potential to not only mitigate, but entirely stop large classes of bugs.

The following is a version of our BrokenToken, which keeps an eye on a single invariant: the ratio between its Ether balance and the total number of tokens it has issued. If that ratio gets out of whack, it knows something has gone wrong, and reverts the transaction.

Amazingly, line 7 is sufficient to fix both of our bugs at once. If someone attempts to claim more tokens from deposit than they deserve, the totalSupply will be raised above the Ether balance, and the transaction will be reverted. I leave it as an exercise to the reader to figure out why the same invariant also fixes the race-to-the-bottom bug.

~~~

EDIT 7/12/2016: On Github, Veox pointed out that there’s a version of race-to-empty that still works, given this implementation. Link to discussion.

~~~

A word of caution, though: If something causes an invariant to get permanently broken, then your contract will get locked down even for legitimate transactions. Make sure you pair this with an effective upgrade path or graceful failure mode, which is not blocked if the invariants fail.

There are of course many more possibilities for building defense in depth. An extreme example would be the strategy NASA is purported to sometimes use, where the program is written separately by multiple teams who use different languages and programming paradigms, and do not look at each other’s code. By comparing the results of each implementation, bugs that are only present in one of them can be safely discarded. This strategy is currently hard to implement on Ethereum as most smart contracts are written in Solidity or the very similar Serpent, but could become quite effective in the future.

Finding bugs

By using defense in depth, you make it substantially harder for an attacker to find an effective attack, because they need to find a confluence of several bugs that work together to pierce every layer of defense. However, in order to prevent them from simply stockpiling bugs that they find, it would be ideal if the good guys are finding and fixing bugs as fast as the bad guys are.

To that end, I’d like to present the last idea in this post, which is to provide automated bug bounties for your contract. Much like an ordinary bug bounty, these offer a reward to white-hat hackers who find bugs in your code. Unlike an ordinary bug bounty, they can happen automatically and anonymously, and trigger other events on the blockchain (such as your emergency stop!).

Anyone who wishes can ask this contract for their own private target version of BrokenToken. If they can then demonstrate a path by which to do something they shouldn’t be able to do, they can claim the bounty and automatically receive the award.

In addition to incentivizing bug reports, this allows us to act immediately when the bug is discovered. The bounty could easily be given the power to pull an emergency stop, “let the developers back in” for upgrades, or otherwise take some defensive action in reaction to the bug disclosure. This provides a decentralized way for responsible disclosure of bugs to happen, and makes bug-hunting substantially more rewarding for white-hats.

Building a full system

Each of the strategies I’ve shown has its own strengths and weaknesses. A full security plan will require a suite of strategies working together to provide protection against bugs. In many cases, these strategies are much stronger in combination than apart: an emergency stop is much more useful when paired with a circuit breaker that slows things down enough for the stop to be pulled, and a strict set of invariants is much less brittle if paired with a solid upgrade path. When building smart contracts, consider not just the individual parts, but the strength of the entire system. Aim for resiliency, redundancy, and as much depth to your security as you can manage.

Conclusion

As we’ve learned very clearly over the past few weeks, bugs are both a fact of life for programmers, and a serious danger to any smart contract. I hope I’ve shown that these two facts can be reconciled, by building fault-tolerant systems which can survive the inevitable discovery of bugs. If the Ethereum ecosystem is going to be a part of the world economy, let’s make it as safe and reliable as we possibly can.

// I originally collected these strategies while working on securing the Ownage.io platform. Come join our Slack for more discussion about this and other blockchain topics!