Presentation from AWS re:Invent 2018

The Infrastructure Team at Coinbase has the goal of enabling any engineer in the company to quickly and securely access and deploy complex infrastructure. This effort started with our secure deployment pipeline Codeflow, was extended by our codification tooling GeoEngineer, and utilized by our blockchain infrastructure project Snapchain.

Recently at AWS re:Invent 2018 we spoke about how we build our Blockchain Infrastructure with Snapchain (video above). This post talks about why and how we built Snapchain.

Coinbase has some unique security and infrastructure requirements. One of these requirements is that every server in our infrastructure is both ephemeral (< 30 days) and immutable. The deploy process for most applications is fairly straightforward: 12-factor apps are blue/green deployed behind a load balancer. This process becomes a lot more challenging when considering blockchain nodes.

Blockchain nodes detect, validate and relay state updates across the network — they are our eyes & ears into the various cryptocurrency networks we support. When someone sends funds into Coinbase we detect that transaction by listening to a node, and when someone sends funds out of Coinbase we broadcast that transaction through one of our nodes. As such, being able to effectively manage blockchain nodes is critical to our core business operations.

The anatomy of a blockchain deployment at a high-level is as follows.

We start with a single EC2 Instance. Once the instance is up and ready, we can start the node binary.

Empty (new) node spin-up on EC2.

Once the node is up, it reaches out to other nodes on the network.

The new node reaches out to the network and finds other nodes to peer with.

The other nodes on the network have fully synced copies of the chain and will start sending blocks down the wire.

These nodes have copies of the particular chain on them, and assuming they are fully synced will start sending blocks down the wire.

The first block is transmitted and the new node validates it.

The new node gets one block and validates it.

Further blocks are sent to the new node.

Then the node gets another block and so forth.

All of the blocks are transmitted, validated, and now fully in-sync.

Until all of the blocks have been transmitted, validated, and are fully in-sync.

Full nodes typically maintain a fully copy of the blockchain on disk. Often, this amounts to hundreds of gigabytes that need to be synced over the network. Depending on the particular chain and implementation, a full sync can take weeks! Considering the pace of development in the cryptocurrency ecosystem, the safety and reliability constraints of our infrastructure, and the occasional urgency of node upgrades, a full sync from the network on every deploy is not a viable solution for us to be able to move safely and quickly. As a result we designed a new blockchain node backup and deployment system called Snapchain.

Disk usage on a geth node on EC2 showing 952GB of usage.

Snapchain launches two types of blockchain nodes — snapshot nodes fully sync the chain and produce copies in the form of EBS volumes, and long-lived nodes use these EBS volumes to finish deploying in minutes instead of days. Snapchain gives us the flexibility to quickly deploy blockchain nodes as frequently as we like in response to version upgrades, events in the network, or to develop against a new type of configuration.

If you are interested in some of the details about how these two types of configurations work you can watch the video (starts 22:08).

The Snapchain deployment on AWS enables us to scale to meet the needs of our engineers with Network Load Balancers (NLB). These NLBs enable us to have static IPs, scale to meet demand, and to perform blue/green deploys.

A typical deployment from one version to another (example shown: 1.0.0 to 1.1.0) works as follows.

A node sits behind the NLB with the chain’s binary, our API to interface with the node, and a data volume with the synced chain.

Current active deployment with a single node behind an NLB.

In reality there are likely multiple nodes for redundancy by placing nodes in multiple Availability Zones (AZ).

Current active deployment with multiple nodes behind an NLB.

New instances are launched and follow the flow described previously under the anatomy of a blockchain.

New instance launched and syncing.

Once the new instance is ready and healthy it is added to the NLB.

The new instance is now behind the NLB.

Once a set of checks pass the new version is ready to be cut over and using an NLB makes this easy.

Checks passed and the NLB is cut over to the new version.

The final state of deployment has the instance behind an NLB just as we started.

Final state of deployment showing the new version.

In addition, Snapchain is blockchain-agnostic infrastructure. One of the primary design goals was to minimize the amount of blockchain-specific code and infrastructure. Doing this allows us to add new blockchain nodes much more quickly.

Showing bitcoind being deployed in 13 minutes.

Are you interested in working on Snapchain and products like it? We’re hiring like crazy! Whether you’re just getting interested in cryptocurrency or are a seasoned blockchain developer, check out coinbase.com/careers to see if something sparks your interest. We have several openings, including a position for a Senior Infrastructure Engineer. We’ve only scratched the surface of what digital currencies can do. Come help us build an open financial system.

Unless otherwise noted, all images provide herein are by Coinbase.