In my last post I described how we use Redis to manage a global lock that allows us to automatically failover to a backup process if there was a problem in the primary process. The method described allegedly allowed for any number of backup processes to work in conjunction to pick up on primary failures and take over processing.





Thanks to an astute reader, it was pointed out that the code in the blog wouldn’t actually work as advertised:

@heyjoshua I might be missing something but the code looks like it'll keep trying to acquire the lock, which it'll can't, due to the SETNX. — Nolan Caudill (@nolancaudill) December 15, 2012

The Problem

Nolan correctly noticed that when the backup processes attempts to acquire the lock via SETNX, that lock key will already exist from when it was acquired by the primary, and thus all subsequent attempts to acquire locks will simply end up constantly trying to acquire a lock that can never be acquired. As a reminder, here’s what we do when we check back on the status of a lock:

function checkLock(payload, lockIdentifier) { client.get(lockIdentifier, function(error, data) { // Error handling elided for brevity if (data !== DONE_VALUE) { acquireLock(payload, data + 1, lockCallback); } else { client.del(lockIdentifier); } }); }

And here’s the relevant bit from acquireLock that calls SETNX:

client.setnx(lockIdentifier, attempt, function(error, data) { if (error) { logger.error("Error trying to acquire redis lock for: %s", lockIdentifier); return callback(error, dataForCallback(false)); } return callback(null, dataForCallback(data === 1)); });

So, you’re thinking, how could this vaunted failover process ever actually work? The answer is simple: the code from that post isn’t what we actually run. The actual production code has a single backup process, so it doesn’t try to re-acquire the lock in the event of failure, it just skips right to trying to send the message itself. In the previous post, I described a more general solution that would work for any number of backup processes, but I missed this one important detail.

That being said, with some relatively minor changes, it’s absolutely possible to support an arbitrary number of backup processes and still maintain the use of the global lock. The trivial solution is to simply have the backup process delete the key before trying to re-acquire the lock (or, technically acquire it anew). However, the problem with that becomes apparent pretty quickly. If there are multiple backup processes all deleting the lock and trying to SETNX a new lock again, there’s a good chance that a race condition could arise wherein one of backups deletes a lock that was acquired by another backup process, rather than the failed lock from the primary.

The Solution

Thankfully, Redis has a solution to help us out here: transactions. By using a combination of WATCH, MULTI, and EXEC, we can perform actions on the lock key and be confident that no one has modified it before our actions can complete. The process to acquire a lock remains the same: many processes will issue a SETNX and only one will win. The changes come into play when the processes that didn’t acquire the lock check back on its status. Whereas before, we simply checked the current value of the lock key, now we must go through the above described Redis transaction process. First we watch the key, then we do what amounts to a check and set (albeit with a few different actions to perform based on the outcome of the check):

function checkLock(payload, lockIdentifier, lastCount) { client.watch(lockIdentifier); client.multi() .get(lockIdentifier) .exec(function(error, replies) { if (!replies) { // Lock value changed while we were checking it, someone else got the lock client.get(lockIdentifier, function(error, newCount) { setTimeout(checkLock, LOCK_EXPIRY, payload, lockIdentifier, newCount); }); return; } var currentCount = replies[0]; if (currentCount === null) { // No lock means someone else completed the work while we were checking on its status and the key has already been deleted return; } else if (currentCount === DONE_VALUE) { // Another process completed the work, let’s delete the lock key client.del(lockIdentifier); } else if (currentCount == lastCount) { // Key still exists, and no one has incremented the lock count, let’s try to reacquire the lock reacquireLock(payload, lockIdentifier, currentCount, doWork); } else { // Key still exists, but the value does not match what we expected, someone else has reacquired the lock, check back later to see how they fared setTimeout(checkLock, LOCK_EXPIRY, payload, lockIdentifier, currentCount); } }); }

As you can see, there are five basic cases we need to deal with after we get the value of the lock key:

If we got a null reply back from Redis, that means that something else changed the value of our key, and our exec was aborted; i.e. someone else got the lock and changed its value before we could do anything. We just treat it as a failure to acquire the lock and check back again later. If we get back a reply from Redis, but the value for the key is null, that means that the work was actually completed and the key was deleted before we could do anything. In this case there’s nothing for us to do at all, so we can stop right away. If we get back a value for the lock key that is equal to our sentinel value, then someone else completed the work, but it’s up to us to clean up the lock key, so we issue a Redis DEL and call our job done. Here’s where things get interesting: if the key still exists, and its value (the number of attempts that have been made) is equal to our last attempt count, then we should try and reacquire the lock. The last scenario is where the key exists but its value (again, the number of attempts that have been made) does not equal our last attempt count. In this case, someone else has already tried to reacquire the lock and failed. We treat this as a failure to acquire the lock and schedule a timeout to check back later to see how whoever did acquire the lock got on. The appropriate action here is debatable. Depending on how long your underlying work takes, it may be better to actually try and reacquire the lock here as well, since whoever acquired the lock may have already failed. This can, however, lead to premature exhaustion of your attempt allotment, so to be safe, we just wait.

So, we’ve checked on our lock, and, since the previous process with the lock failed to complete its work, it’s time to actually try and reacquire the lock. The process in this case is similar to the above inasmuch as we must use Redis transactions to manage the reacquisition process, thankfully however, the steps are (somewhat) simpler:

function reacquireLock(payload, lockIdentifier, attemptCount, callback) { client.watch(lockIdentifier); client.get(lockIdentifier, function(error, data) { if (!data) { // Lock is gone, someone else completed the work and deleted the lock, nothing to do here, stop watching and carry on client.unwatch(); return; } var attempts = parseInt(data, 10) + 1; if (attempts > MAX_ATTEMPTS) { // Our allotment has been exceeded by another process, unwatch and expire the key client.unwatch(); client.expire(lockIdentifier, ((LOCK_EXPIRY / 1000) * 2)); return; } client.multi() .set(lockIdentifier, attempts) .exec(function(error, replies) { if (!replies) { // The value changed out from under us, we didn't get the lock! client.get(lockIdentifier, function(error, currentAttemptCount) { setTimeout(checkLock, LOCK_TIMEOUT, payload, lockIdentifier, currentAttemptCount); }); } else { // Hooray, we acquired the lock! callback(null, { "acquired" : true, "lockIdentifier" : lockIdentifier, "payload" : payload }); } }); }); }

As with checkLock we start out by watching the lock key, and proceed do a (comparitively) simplified check and set. In this case, we’ve “only” got three scenarios to deal with:

If we’ve already exceeded our allotment of attempts, it’s time to give up. In this case, the allotment was actually exceeded in another worker, so we can just stop right away. We make sure to unwatch the key, and set it expire at some point far enough in the future that any remaining processes attempting to acquire locks will also see that it’s time to give up.

Assuming we’re still good to keep working, we try and update the lock key within a MULTI/EXEC block, where we have our remaining two scenarios:

If we get no replies back, that again means that something changed the value of the lock key during our transaction and the EXEC was aborted. Since we failed to acquire the lock we just check back later to see what happened to whoever did acquire the lock. The last scenario is the one in which we managed to acquire the lock. In this case we just go ahead and do our work and hopefully complete it!

Bonus!

To make managing global locks even easier, I’ve gone ahead and generalized all the code mentioned in both this and the previous post on the subject into a tidy little event based npm package: https://github.com/yahoo/redis-locking-worker. Here’s a quick snippet of how to implement global locks using this new package:

var RedisLockingWorker = require("redis-locking-worker”); var SUCCESS_CHANCE = 0.15; var lock = new RedisLockingWorker({ "lockKey" : "mylock", "statusLevel" : RedisLockingWorker.StatusLevels.Verbose, "lockTimeout" : 5000, "maxAttempts" : 5 }); lock.on("acquired", function(lastAttempt) { if (Math.random() <= SUCCESS_CHANCE) { console.log("Completed work successfully!", lastAttempt); lock.done(lastAttempt); } else { // oh no, we failed to do work! console.log("Failed to do work"); } }); lock.acquire();

There’s also a few other events you can use to track the lock status:

lock.on("locked", function() { console.log("Did not acquire lock, someone beat us to it"); }); lock.on("error", function(error) { console.error("Error from lock: %j", error); }); lock.on("status", function(message) { console.log("Status message from lock: %s", message); });

More Bonus!

If you don’t need the added complexity if multiple backup processes, I also want to give credit to npm user pokehanai who took the methodology described in the original post and created a generalized version of the two-worker solution: https://npmjs.org/package/redis-paired-worker.

Wrapping Up

So there you have it! Coordinating work on any number of processes across any number of hosts couldn’t be easier! If you have any questions or comments on this, please feel free to follow up on Twitter.

Like this post? Have a love of online photography? Want to work with us? Flickr is hiring engineers, designers and product managers in our San Francisco office. Find out more at flickr.com/jobs.