When your systems administrator asks if you’re sitting down, you know it’s not going to be good. He then informed me that a background process I recently wrote had gone a little crazy. How crazy, you ask? $13,000 crazy.

It’s not rare for me to make mistakes. Things happen, our amazing students respond with bug reports, and we fix them. And while this mistake didn’t come from traditional channels, it could have been just as easily caught by listening a little more closely and having a greater awareness of our systems.

In order to understand what went wrong, you’ll need a little background. Since Code School started, we’ve hosted videos on Viddler, an amazing video encoding and hosting service that has and continues to serve us extremely well. In years of working with them, there’s only been a handful of times when Viddler was down long enough for us to be worried. Videos are a critical part of our business, so having a single point of failure has always been a major concern. So, in mid-2014 I was working to add a hot-swap backup for Viddler.

Projector

To better organize backups, we moved the video responsibility to a single application (we call it Projector) that allows us to have backups of our videos and a failover in place for the rare times when Viddler is having issues. If you’re watching videos in a course, playing them from our iOS application, or watching videos from Code School, they’re going through Projector.

Part of rolling this project out also meant creating a script that downloaded every existing video we have from Viddler and uploading them to an alternate CDN acting as our hot-spare. This is the piece of the puzzle that went wrong.

The Culprit

Projector itself is an extremely simple Ruby on Rails application. All it needs to do is redirect valid requests for videos to wherever that video is located (Viddler or a backup source). The first time a video is loaded on Projector, it will add a background job to copy that file over to our CDN.

We’re using Delayed Job as our queuing system, which makes creating background jobs as easy as creating a Ruby class and starting a process on the server. If a Delayed Job raises an exception, it’ll be retried later — but there’s a limit on how many times a job will be retried (25 by default). Here’s what our setup looks like for saving these files to our cache server: