Spark is a collaborative, web-based tool written in React/Node.js, and deployed on the cloud to Heroku. The editor is built using the open source library Prosemirror, the content is stored in MongoDB and collaboration happens over websockets using a Redis caching layer for the changes coming through.

As the team has grown and the tool becomes more fully fledged, we became aware of knowledge silos starting to form. Inspired by our colleagues in Customer Products, we decided to host our very own Documentation Day. This was a fun (there were drinks and snacks) and productive way to both spread knowledge and decrease our operational risk.

Spark team intensely writing documentation

One of the things we learned from this exercise was that nobody in the team knew how to back-up and restore the database. So the following day, two of us decided to go through the exercise in our staging environment and document the process.

This mini-chaos test taught us two valuable things that we wouldn’t have known otherwise:

The database was set to back-up every 24 hours. This wasn’t really an acceptable timeframe for us or editorial. The process of restoring the back-up caused a few minutes of downtime (understandable), and even after it finished our apps needed restarting (this was less obvious and would have bitten us in an out-of hours scenario).

So all in all, a raging success. We increased our production back-up frequency to every two hours, and then as the day came to a close we quickly ran through the process again in staging so we could write it up.

In staging.

Staging…

It wasn’t staging.

It was production.

Holy moly, Production?!! But restoring to a back-up causes downtime?! And the last back-up was at 5am?!

Yuup 😱:headinhands:

Ok, this was bad. Spark was down and we had lost all the journalists’ changes from that day. This included important articles that were due to be published in the next couple of hours. They were going to be understandably, very annoyed.

We also faced losing all of our users’ trust and confidence that we had been working so hard to gain over the last few months. We would be back to square one if we didn’t get this fixed fast.

So, what did we do?

Instinctively we wanted to try and stop the restore. Our staging test showed us that the process took around 15 minutes, but we couldn’t find anything in the UI. We fired an email to our MongoDB provider, but it was too late to stop it. It was looking grim, the sinking feeling was sinking deeper and any glimmer of hope for restoring content was fading fast.

“Can we take a back-up of Redis?”, someone shouted. Despite the high stress and adrenaline levels, the team somehow managed to recall that we keep a temporary cache of our articles in Redis, to speed up our collaborative editing feature. This cache has a short time-to-live, so it was imperative that we pulled this down as soon as possible!

The restore had completed, and as we learnt to do earlier in the day, we restarted the app. Spark was now showing articles from 5 am that morning. We had already told our stakeholders what was happening, and to their credit, they were doing a fantastic job of protecting us from the inevitable questions trickling in from users. They managed to compile a list of the missing articles from FT.com, and prioritised them based on urgency.

With a local Redis in place, we confirmed that all the 37 missing articles were there! The format of the data stored is slightly different between Redis and Mongo, but we had enough to be able to manually recreate all the content. It was a slow and finicky process, but together we managed to bring everything back exactly the way it was, and most of our users were none the wiser! 😅

It was late. We were mentally and emotionally drained. It was time for the Market Porter.