How Coinbase Builds Secure Infrastructure To Store Bitcoin In The Cloud

Or how to build paranoid infrastructure on top of AWS

Three and a half years ago, Coinbase launched using a simple hosting platform: Heroku.

It was the right solution at the time. With just two technical founders building the product (neither with any serious dev-ops experience) we knew that Heroku would be more battle tested than any solution we could hack together on our own.

But we also knew this wouldn’t work forever. Early in our company’s history, we started to contemplate the next version of our infrastructure that would run inside AWS. It had to be built from the ground up with security in mind (the most common ways that bitcoin companies die is due to theft and hacking) but we didn’t want to compromise on engineering happiness and productivity.

After about a year, we finally completed the transition and we’ve been running inside AWS for quite some time now. This post outlines some of what we learned during the transition. It can be used as a starting point to building paranoid and productive infrastructure in the cloud.

Today, Coinbase securely stores about 10% of all bitcoin in circulation.

Disclaimer: Though we discuss some of our security measures below, our security measures are continually evolving. These are just a few measures that have existed at one point in our growth. For more on our approach to security, see this YouTube talk.

Layered Security & No Single Point Of Failure

Two of the most important principles we followed when designing our infrastructure are layered security and eliminating single points of failure. Both of these concepts encourage you to not put all your eggs in one basket. Instead, strive for redundancy and consensus amongst multiple parties. These concepts are used heavily in bank security, nuclear launches, certificate authorities, corporate governance, and even human resources.

Separate passwords and second factor tokens amongst different people.

A simple example of this in practice is securing your administrator account on AWS with a two factor token that is controlled by a second person. If you have one person who controls the password to the account, give the second factor token to another party. Store the second factor in a vault or safe deposit box off site for some physical (in addition to crypto) based security.

It can prevent a single person maliciously (or accidentally) ending the company.

Lock Down Production Access

The developers on your team should not have (or need) production SSH access to do their regular work (deploying code, spinning up new services, debugging, etc).

However, it is difficult to entirely eliminate the need for SSH access. Some people in the company will always need a way to debug obscure problems. When people do need SSH access, here is how you can lock it down:

Add two factor to your SSH

Every SSH should require a second factor. You can use Duo two factor authentication for SSH which pushes an approval request to your phone, or a FIDO U2F key which is like a small hardware security module on a USB stick. As mentioned above, if you don’t want anyone to be able to unilaterally SSH into production, you can even separate these keys and require all SSH to be “pair programmed”. Use special laptops for SSH access

You may want to avoid being able to SSH into production using your regular laptop. Getting malware on a laptop due to spear phishing has been responsible for many (if not most) of the high profile breaches we’ve seen in recent years. People often assume that hacks are caused by 0-day vulnerabilities or other sophisticated techniques. But in reality, simple spear phishing (clicking on spoofed links in emails) is far more likely to get you. We’ve seen attackers dedicate 6 months or more to establishing relationships, all with the intent of spear phishing. Set aside some special machines in the office that are in a locked room, that you only use for SSH access. Throw a Dropcam in the room to record who enters and leaves. Don’t use the machines in the room to open email, or browse the internet (you can use your regular laptop for this). You should probably wipe these special SSH machines on a regular basis as well. Heavily Audit SSH Access

Set up bastion hosts that all SSH requests must run through. Restrict who has access to these least-privilege hosts and let the team know when they’re accessed (via Slack notifications). You can also wake people up (PagerDuty) when certain commands are issued. To avoid an untraceable action after getting inside, durably log every action and keystroke which goes through the bastion host. Coinbase wrote a custom piece of software to handle this portion, and it may be something we open source in the future. The storage of the SSH logs is just as important, since they often contain sensitive info. We run a separate disaster recovery environment that guarantees storage of every action in our environment for at least 10 years. Immutable logging is important because it gives you an audit trail if a breach ever happens to find the root cause. Limit SSH access to people who are less likely to steal

Set up special rules around who gets production access. For each employee who gets access, run a background check on them to check for criminal records, make copies of their driver’s license and passport, and you may even want to collect a copy of their fingerprints. Make sure you have everything you need to issue an arrest warrant if something ever goes wrong. This one can be controversial, but you may want to only grant production access to people who are citizens of the country where you operate, especially if they have family ties there. Most people would be unwilling to steal $1M if it meant never being able to see their friends and family ever again. You want to create a culture where production access is taken very seriously. It should come with a great deal of responsibility and oversight.

Cold storage

If you have any particularly sensitive keys (in our case, bitcoin private keys) try storing them entirely offline (air-gapped). Coinbase early on made a decision to store the vast majority (98%+) of customer bitcoin entirely offline, in safe deposit boxes.

Air-gapped security.

Version one looked like this. Just some USB drives (and paper backups) stored in a safe deposit box at a local bank.

We’re now on version three of our cold storage and it has come a long way. Keys are generated offline in a secure environment, and split using Shamir’s secret sharing. Each private key is divided into parts, and some subset of the pieces are required to restore the secret. This way some pieces can be lost and the secret is still recoverable (redundancy). It also requires a quorum of key-holders to come together to restore a key (consensus).

Key holders are geographically distributed and follow a protocol during key signing ceremonies to verify their identity and assure the integrity of the ceremony.

Here is a simple example of generating a 2 of 3 key (where at least 2 of the 3 pieces are required to recombine the secret) using Hashicorp’s open source Vault Project. You can require 5 of 10 pieces or any threshold you’re comfortable with.

Log All The Things

In addition to logging production access (as mentioned before), you should log everything happening across all containers in your infrastructure.

It is critical to have a good audit trail if there ever is an incident. The only thing worse than being hacked, is being hacked but not knowing how it happened. Your only option then is to hope you’ve patched the right thing, and relaunch with fingers crossed (guess and check).

Great logging also creates a deterrent against theft. People are less likely to steal if they feel there is a chance they will get caught.

Designing an environment focused on low-latency and high variety logs required a new design for our new infrastructure. To reduce the complexity of logging, we wanted to push all of our logs through one place that could be consumed in many ways. Running bitcoin nodes around the world required our logging endpoints to be accessible across many networks. To minimize the complexity of adding log producers and consumers, we now pipe every event across Coinbase through a streaming, distributed log (Kinesis) that provides flexible at-least-once guaranteed processing and a multi-day buffer of data that can be replayed as needed.

We run a fleet of Docker containers that process the entirety of this pipe to perform a variety of transformations, evaluations and transfer data to more permanent homes for archival, search and more.

Anomaly Detection

Another piece of software that we built looks for irregularities in the logs flowing through Kinesis. It has three levels of alerts when it detects something:

Warnings

Warnings appear in our infrastructure Slack channel and can be passively observed by the team to get context. An example would be someone attempting to brute force passwords (and running into our rate limiting).

Warnings appear in our infrastructure Slack channel and can be passively observed by the team to get context. An example would be someone attempting to brute force passwords (and running into our rate limiting). Errors

Errors trigger PagerDuty so that someone gets woken up even if it is the middle of the night. These represent more serious issues that require immediate attention. An example would be an unusual movement of funds.

Errors trigger PagerDuty so that someone gets woken up even if it is the middle of the night. These represent more serious issues that require immediate attention. An example would be an unusual movement of funds. Critical issues

These can trigger a kill switch we have set up that gracefully shuts down critical services (including outgoing payment processing). Kill switches require their own key signing ceremonies to re-enable them. An example of a critical issue would be unauthorized access to certain machines or services.

Consensus Based Deploys

Earlier I mentioned that developers shouldn’t need production access to do their regular work. One of the most common tasks in development is deploying new code.

To solve this, we’ve developed a number of tools in our infrastructure around the idea of consensus. We believe in a 3 phase process where anyone should be able to propose any change, but consensus should be achieved before that proposal can be applied. One of our tools enabling this is called Sauron which comments on every pull request. It requires approvals on pull requests (+1’s from other developers) before code can be deployed into production.

This particular branch, requires a +1 from two developers other than the author. But more sensitive services in our SOA can require even more approvers of special types. During periods of increased risk, such as if you suspect someone’s laptop has been compromised, you can dial the number of +1’s up as high as is needed system wide (without blocking all deploys). This protects against the case where one or more developers get malware on their laptop.

We also use this idea of consensus based changes when making updating our environment (such as the docker-compose files we use to launch our services). Anyone can propose a change, but no single person can push a change to production.

We’ve been running entirely on Docker in production for over a year now. Prior to the influx of new deployment tools (and looking to embrace a consensus approach to deployment), we started building our own tool, CodeFlow. Codeflow gives each developer the power to deploy their code by combining a Dockerfile, Docker-Compose file, and Envars to deploy 12-factor applications. We could probably write a whole blog post on this single tool, but the goal of it is to combine consensus based deployment with the developer productivity and happiness of Heroku.

Conclusion

This post just scratches the surface of what it takes to build secure/paranoid infrastructure in the cloud (as you can see there are a lot of moving parts!). And there are plenty of topics we didn’t have time to cover here, including:

red team drills

bug bounty programs

pen tests with outside firms

working with vendors who store PII

incident response

educating new developers who join the team

Interested In Learning More?

Although we’ve come a long way, we still have much more to do as the most popular place to store digital currency.

If you’re interested in working with the awesome team behind some of these tools, we’d love to speak with you. We’re hiring remote and in house dev-ops and generalist engineers (you can work at our office in San Francisco, or remotely).

If you are a devops engineer or generalist engineer, we’ve set up a coding challenge that you can take in about 45 minutes if you’d like to show us your skills.

You can also apply through our careers site, or send any questions you might have to talent at coinbase dot com. We have flexible work hours and vacations, very competitive compensation (both in cash and equity), and fly everyone out to HQ in San Francisco once a quarter.