Building and deploying web applications are more complex than they have ever been. We have transpilation, automated tests, continuous integration, continuous deployment, automated database migrations, and more. Unfortunately, the tools we use oftentimes don’t fit together nicely. To bridge the gaps, we write scripts and develop internal tools to simplify and automate manual steps.

The interfaces for our tools can be brittle and unforgiving, but we justify it with naïvely assumptions. We presume our tools will only ever be used in our development environment, it will only be used once, or it will be refactored before it makes it into production.

When things go wrong

You don’t have to look far to find examples of inadequate internal tools betraying their creators. Here are a few case studies and the steps that were implemented to prevent them from failing in the future.

Amazon S3’s logo

Amazon S3

From 9:37 AM PST February 28, 2017, until 1:18 PM the same day, Amazon S3 experienced a major service disruption. You probably remember it because Slack was down and your clients were calling about their broken websites.

The outage was caused by a team member mistyping a commonly run command to shut down several servers. The command shut down more servers than intended, at a faster rate than Amazon’s failsafes could handle. Amazon is prepared for situations like this, but not at this magnitude. The limited remaining capacity of S3 was unable to service incoming requests and needed to be restarted, which led to an extended outage.

The command that caused the outage didn’t have safeguards on the maximum number of servers that could be shut down, nor on the maximum rate they could be removed. The solution was clear: improve the user experience of their internal tool to be more forgiving to mistyped commands.

We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level.

Amazon modified the tool to ensure the safety of the input, prior to using that input, much like how user input is sanitized before adding it to a SQL statement. You would never unconditionally trust user input in a web application, and you should never unconditionally trust input to your internal tools either.

Source: Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region

GitLab.com

GitLab’s logo

On January 31st, 2017, an authorized GitLab employee was trying to solve a performance issue on their production PostgreSQL database. Shortly after executing a command, this individual realized that they had inadvertently deleted a large amount of data from the primary production database.

The result was an outage that started at 6 PM UTC and lasted for 24 hours. The permanent damage included in six hours of data loss, which affected more than 5000 hosted projects on the site.

Trying to restore the replication process, an engineer proceeds to wipe the PostgreSQL database directory, errantly thinking they were doing so on the secondary. Unfortunately this process was executed on the primary instead. The engineer terminated the process a second or two after noticing their mistake, but at this point around 300 GB of data had already been removed.

Outages are often a combination of compounded issues and this case is no different. Inadequate backups and insufficient documentation contributed to this outage, but it was provoked by an employee running the right command on the wrong database server.

GitLab took many steps in order to prevent this sort of issue from happening again. One of them was to change the prompt on the server they are connecting to, to make it clear which environment they are interacting with. Keeping your environments and hosts distinct is essential to prevent issues such as this one. Colour coding, using different background colours, and unambiguous naming can help to keep them separate.

Source: Postmortem of database outage of January 31

Hawaii false missle alert

HiEMA’s logo

At 8:07 HST on January 13, 2018, an emergency alert was broadcasted by television, radio, and to cell phones.

BALLISTIC MISSILE THREAT INBOUND TO HAWAII. SEEK IMMEDIATE SHELTER. THIS IS NOT A DRILL.

38 minutes later, a second alert was sent indicating the initial alert had been a false alarm and there was no threat. The alert was sent by mistake — a Hawaii Emergency Management Agency employee attempting to send a test version of the same message. For reference, the interface that the employee was interacting with can be seen below.

The user interface for triggering emergency alerts. Source: CivilBeat on Twitter

The message the employee clicked was, PACOM (CDW) — STATE ONLY , while they intended to click on DRILL — PACOM (CDW) — STATE ONLY . You might be able to appreciate how someone could send the wrong message.

The Hawaii Emergency Management Agency hasn’t made a statement indicating which user interface changes have made to prevent a similar incident from occurring in the future. There are some obvious improvements that could be made, such as separating test and real alerts into unique colour-coded sections and requiring a confirmation for an alert, to help prevent mis-clicks of the wrong item.

Source: Wikipedia article for 2018 Hawaii false missile alert

Internal tool UX guidelines

Each of these case studies presents a unique combination of problems and there is no silver bullet for fixing them. One issue they have in common is poor user interfaces for their internal tools.

Good UX is important, even for things like scripts. Unfortunately a lot of tech people take pride in working with hard-to-use and error-prone tools. — Kevan, Hacker News

When developing these tools, it’s often possible to predict which risks exist. It’s important to highlight them for the user, even if you expect the user will only be you. Below are a few guidelines to help prevent making similar mistakes.