Keystone Metrics in DevOps: The 30 Day Project @ Coinbase

Alcoa Keystone

Charles Duhigg in his book Power of Habit discussed how Paul O’Neil, the CEO of Alcoa (Aluminum Company of America), was able to increase his company’s value by 27 billion by focusing on a goal unrelated the main company objectives, no workplace injuries. This is a keystone metric (habit), a broad goal for the entire company, not directly related to the company’s main goal but impactful and actionable. O’Neil chose safety as the single most important aspect because:

Most organizations say “our human beings are the most important asset”, but in most places there is no proof that it is really true. It is just something you say […] Its Alcoa’s objective that people who work for Alcoa will never be hurt at work […] Safety is not a priority, it is a precondition! [cite]

“We can’t afford it” and “it is too difficult, it will stop people from being able to do their work” were reasons people used as to why Alcoa could never reach this goal. Persisting, O’Neil began to have executive meetings on every workplace death of an employee, because he saw it as a personal responsibility to identify and fix anything that could led to an employee being hurt. He also told his executive and management staff that it was their responsibility if an injury occurred under their watch. This led to everyone in Alcoa taking the goal seriously, which in turn opened up communication channels for any employee to quickly escalate safety issues to higher management.

Trying to reach this keystone metric resulted in many benefits for company; the injury rate dropped from 1.86 to 0.125 per year, the new communication channels were used for increasing productivity and visibility from the ground up to the executives, and bad managers were quickly identified and removed if they could not follow the safety procedures and guidelines. Everyone at the company was working towards the same goal, and succeeding.

DevOps Keystone

One of the reasons why I joined the DevOps team at Coinbase was the “30 day project”, a keystone metric to never have a server older than 30 days. I thought this was an ambitious (maybe impossible) goal, but I wanted to be part of the team with that kind of vision.

The age of the servers Coinbase runs on does not directly impact the performance of the company or team, and other teams might say “we can’t afford it” or “it is too difficult, and we would end up not doing more important work”. However, we have seen many benefits from trying to reach the 30 day goal:

Everything is Redeployable: we have a process around redeploying and/or upgrading every server we have running. No Shadow Infrastructure: we are always looking for servers that are unknown and finding out what is their purpose, if they are necessary, and ensuring we understand how to manage and upgrade them. Reduce Code Rot: by rebuilding and redeploying services we ensure that all dependencies still resolve and are up to date. Revisiting Decisions: past decisions about architecture, storage, deployments are reevaluated to see if we can redo them with better outcomes. Reduce Security Response Time: if a security CVE, like OpenSSL’s heartbleed, is released we know with certainty that we can quickly upgrade and redeploy our entire infrastructure. Sharing Knowledge: the high frequency of deploys necessitates that multiple people will need to learn to redeploy a service. This means that the process will become more repeatable, less painful, safer, and better documented with every iteration.

As a result of the hard work we put into the 30 day project we were able to organize an event called “Scorched Earth” where we rebuilt the base operating system (AMI) and every Docker container, then redeployed every server in our infrastructure in under 24 hours. We succeeded with 30 mins to spare, meaning that all of Coinbase was running on servers less than a day old! We did this with 0 downtime, 0 errors, and while the rest of the organization worked without disruption.

In the future we will continue to maintain our 30 day keystone metric not because it is a priority, but because it is a precondition of our team.