I broke Discourse yesterday.

Not in the “its just a little bit broken sense”. Instead, in the absolutely everything is broken and every page returns a 500 error code, sense.

We already do so much to ensure Discourse always works, in true Eric Ries lean startup, Rails best practice way.

We have 1800 specs that exercise a huge portion of the Ruby code.

We have 80 or so tests the test some of the JavaScript (we need more, lots more)

We constantly deploy to our test servers as soon as a commit happens. Only after this succeeds will we allow deploys to production.

Once a deploy to test happens we get a group chat message informing us it succeeded with a clear link to click (that takes us to our staging environment).

Nonetheless, somehow I managed to mess it all up and deploy a junk build.

###What happened?

The Rails asset pipeline is a bit of a Rubik’s Cube. For example, if in your master layout you have <%= asset_path "my_asset" %> it may work fine in your dev and test environments. However, if you forgot to set a magic switch to pre-compile the asset, well… everything breaks in production. In my case, I had it all mostly wired up and working in dev, but a missing .js extension meant that I just was not close enough.

I clicked build, everything passed and forgot to check the test site.

This is my fault, its 100% my fault.

Often when we hit these kind of issues, as developers, we love assigning blame. Blame points my way … don’t do it again … move along nothing more to see.

That is not good enough.

What is good enough?

I would like to follow a simple practice at Discourse. If you break production for any reason, we should make sure an automated system catches that kind of break next time along.

If you break production, the only thing you are allowed to work on should be the system that stops that kind of break in future.

What kind of system can avoid a totally broken build from getting out there?

The trivial thing to do is simply make a HTTP request to staging (our non customer facing production clone) and ensure it comes back with a 200 code. Trivial to add.

However, that is not really good enough.

I would like to know that the pages are all rendered properly. At least 3 key pages to start off with. The home page, a topic page and a user page.

What makes Discourse particularly tricky is that it is an Ember.js app. You only get to see the “real” page after a pile of JavaScript work happens. Simply downloading the content and testing it, is not going to cut it.

Back in the old days we would use Selenium for these kind of tests, trouble is its not really easy to automate and involves a fairly complex setup.

These days people mostly use PhantomJS, a headless WebKit browser.

Now, if you are planning on using PhantomJS I would strongly recommend using a framework like CasperJS to lean on. It does a lot of the messy work for you. For my initial humble test I decided to write it all by hand. There were quite a few reasons.

I wanted to know how the underlying APIs work. I needed a bunch of special hacks to get it to test in a particular way with special magic delays. I did not want to bring in another complex install process in to the open source project.

I ended up with this test:

/*global phantom:true */ console.log('Starting Smoke Test'); var system = require('system'); if(system.args.length !== 2) { console.log("expecting phantomjs {smoke_test.js} {base_url}"); phantom.exit(1); } var page = require('webpage').create(); page.waitFor = function(desc, fn, timeout, after) { var check,start; start = +new Date(); check = function() { var r; try { r = page.evaluate(fn); } catch(err) { // next time } var diff = (+new Date()) - start; if(r) { console.log("PASSED: " + desc + " " + diff + "ms" ); after(true); } else { if(diff > timeout) { console.log("FAILED: " + desc + " " + diff + "ms"); after(false); } else { setTimeout(check, 50); } } }; check(); }; var actions = []; var test = function(desc, fn) { actions.push({test: fn, desc: desc}); }; var navigate = function(desc, fn) { actions.push({navigate: fn, desc: desc}); }; var run = function(){ var allPassed = true; var done = function() { if(allPassed) { console.log("ALL PASSED"); } else { console.log("SMOKE TEST FAILED"); } phantom.exit(); }; var performNextAction = function(){ if(actions.length === 0) { done(); } else{ var action = actions[0]; actions = actions.splice(1); if(action.test) { page.waitFor(action.desc, action.test, 10000, function(success){ allPassed = allPassed && success; performNextAction(); }); } else if(action.navigate) { console.log("NAVIGATE: " + action.desc); page.evaluate(action.navigate); performNextAction(); } } }; performNextAction(); }; page.runTests = function(){ test("more than one topic shows up", function() { return jQuery('#topic-list tbody tr').length > 0; }); test("expect a log in button", function(){ return jQuery('.current-username .btn').text() === 'Log In'; }); navigate("navigate to first topic", function(){ Em.run.later(function(){ jQuery('.main-link a:first').click(); }, 500); }); test("at least one post body", function(){ return jQuery('.topic-post').length > 0; }); navigate("navigate to first user", function(){ // for whatever reason the clicks do not respond at the beginning Em.run.later(function(){ jQuery('.topic-meta-data a:first').focus().click(); },500); }); test("has about me section",function(){ return jQuery('.about-me').length === 1; }); run(); }; page.open(system.args[1], function (status) { console.log("Opened " + system.args[1]); page.runTests(); });

Now… after we deploy staging we run rake smoke:test URL=http://staging.server and get a result.

Amazingly, less than a day after I wrote it, it already caught another junk build.

This is a start, I imagine that in a few months we will have a much more extensive smoke test process.

That said, if you do not have any kind of smoke test process I would strongly recommend exploring PhantomJS. Getting something basic up is a matter of hours.