Production Clojure Checklist

This is a blog about the development of Yeller, The Exception Tracker with Answers Read more about Yeller here

Running an application in production is challenging. There’s so many things to take care of, and it’s easy to miss a little thing that will compromise you a bunch in the future. Checklists are an obvious, and easy fix for this, and are well implemented by Heroku, Github, and many other well known companies.

This is just a starting point - you should adapt this for your services, but it’s a great start.

First, a brief overview:

Use bcrypt for password storage

Servers are in UTC

HTTPS only

Application errors are sent to an exception tracker

Alerts a human if it’s down

Web serving is redundant

Database Migrations are Automated

Configuration is sensible

Deploys are automated via a well documented script

Deploys just copy an uberjar to the servers and restart them

The database is backed up regularly

Database backups are test restored regularly

High traffic pages are behind a CDN

Code is visible on Github

Credentials are available

Application has a health check

There’s an Ops Playbook

High Fidelity Staging Environment

The rest of the team knows the service exists

Transactional email handled via an external service

Api requests via a separate domain

This seems like a long list. It is! Running apps in production is difficult.

Here’s some more details on each entry:

The Basics

All of these are non-negotiable if you’re running a production service whose code impacts humans:

use bcrypt if you store passwords (read more)

servers should be in UTC (read more)

HTTPS ONLY (http requests just redirects to https)

Application errors are tracked using an exception tracker (read more)

JavaScript errors are tracked using an exception tracker (read more)

It’s 2015.

Alerts a human if it’s down (I like pingdom and pagerduty for this)

You’d be surprised how many applications are deployed in production that don’t alert humans when they’re down. This one is specifically about not making your customers mad - I for one would much rather be woken up at 3am with a page than wake up at 10am to thousands of angry customer emails.

Web serving is redundant (at least two processes, so you can deploy without downtime)

Again, super obvious - deploys causing downtime isn’t good enough at all.

Database Migrations are Automated

I like conformity for Datomic, but similar things exist for sql databases. Use them. Migrations should also be kept in the same repo as the app.

Sensible Configuration

I like both the 12 factor approach and dropwizard’s single static file approach

Deploys happen via a well documented script and are automated

I like fabric

Deploys just copy an uberjar to the servers and restart processes

Read Phil Hagelberg on this. This is very standard these days.

The database is backed up regularly

Preventing production data loss in the case of a catastrophic event turns “oh shit we lost the DC” from a company destroying event into a “we were down for a bit” event.

(I like tarsnap)

Restores of the production backups happen regularly

If your database isn’t being restored regularly, then you have no idea if the backups are working correctly. You don’t need to be able to make a backup in the case of disaster. You need to be able to make a restore in the case of disaster.

Read More

High Traffic static pages are behind a CDN

Lots of traffic can quite happily break many a clojure app (especially if it’s sudden and/or unexpected). Putting static pages like your blog and homepage behind a CDN protects you from this in the future, and takes 5 minutes to do. Plus it means those pages will load super fast.

I like Fastly

Code is visible on Github

This sounds super silly, who would ever deploy an application or service without making the code available to other team members.

You’d be surprised.

Any credentials are readily available to all team members

SSH access (if applicable), third party service logins, internal admin accounts, heroku access should all be available to any developer on the team. This one’s obvious, but often not well followed. Quite a few times I’ve seen team members unable to fix a broken app because they didn’t have access to what they needed, and the CTO is on vacation and unreachable.

Some common credentials you might have in play:

SSH access

third party service logins

internal admin accounts

heroku access

The application has a health check

A health check is (typically) an http route that’s hit by the load balancer (and/or third party monitoring services) that returns 200 OK if the service and all of its dependencies are working (e.g. database connections, required third party services, etc), and 500 ERROR and alerts somebody if the service or any of its dependencies aren’t working.

There’s an ops playbook

Ops playbooks detail what to do in the case of alerts. A super simple playbook for a basic clojure web app that talks to a database might look something like this:

If the site is down, here are some places to start looking: Was there a recent deploy? If so, consider rolling it back What kind of errors is the site spitting out? Is it 500 errors? Check the error tracker. 504 errors? Check the web server logs at /var/log/nginx/error.log Is the database overloaded? Log into DATABASE SERVER and check the CPU usage and iowait times. Don’t be shy about restarting services if they’ve fallen over. To restart the webserver, sudo /etc/init.d/yourapp restart If you can’t diagnose quickly; ask for help: call CTO on XXX or OPS PERSON on XXX If it appears to be a networking issue (i.e. you can’t even ping the servers), look at our hosting provider’s status page: http://HOSTINGSERVICESTATUS.com

The application has a high fidelity staging environment, and larger changes are always tested on staging first

If you’re deploying a thing in production, you need a way to test significant ops changes without taking production down, losing data etc. Staging is the best way to do that.

The rest of the team knows the application exists (send an email)

You’d be surprised how many times I’ve heard of services going into production that nobody knew existed, and then them breaking and folk having to learn of their existence super quickly.

Uses an external service for transactional email

Sending email is hard. Working around spam reports, delivering properly with retries, etc etc is really painful. Use an external service like mailgun, mandrill or sendgrid.

Api requests go via a separate domain (typically api.YOURDOMAIN.com)

This might sound silly, but you really want this at the start, because it means you can break out a separate service that handles the API at your leisure, without breaking backwards compatability.

Next Steps

What things is your app missing from this list? What things do you think are important that are missing from it? Hit me up on twitter and let me know: twitter.com/t_crayford

Further Reading

Thoughtbot’s Playbook has a checklist section that covers many of these points for rails apps

Noah Zoschke gave a fantastic talk about Heroku Operations which briefly touched on production checklists. It was the original inspiration for this post.

This is a blog about the development of Yeller, the Exception Tracker with Answers. Read more about Yeller here