The switch to AWS—June 2012

In June 2012, with 100,000 users and some level of product-market fit in hand Joel and Tom decided it was time to migrate infrastructure so that scaling becomes a lot more, well scalable. They made the switch to AWS using elasticbeanstalk, SQS , and S3.

With this change, here’s how post scheduling looked.

Cron job runs every minute on a single server and grabs updates that are due now. This cron job would then create an SQS (Amazon Queuing service) message. A worker running in a cluster of utility servers would pick off a message and process it. This worker would post the status update on Twitter, Facebook etc and mark as sent.

This new architecture formed the foundation for how Buffer works today. We’ve had an amazing experience using SQS and in almost two years of using it, haven’t had any unexpected message delay or downtime with it (knock-on-wood). It’s incredibly well architected and handles every thing we throw at it.

The separation of scheduling and processing was fundamental to tackling the challenges faced in scaling. Now it’s as simple as adding more workers to process the queue instead of upgrading to a server with beefier specs.

Concurrency

One of the early problems that was encountered when switching to this queue/worker set up was random unexpected behavior. Duplicate posting was probably the biggest and most noticeable one. Concurrency is often hard to debug, so there has been a lot of thought that went into understanding the flow and why issues like duplicate posting would occur.

To solve this, we designed a life cycle state machine of an update. We came up with these states throughout the course of an update being scheduled:

buffer — currently in the buffer

pending—picked off by the cronjob and added to SQS to be scheduled

processing—picked off by a worker and currently being processed

analytics/sent—finished sending and is viewable in analytics (analytics checking has its own state machine)

Even with this paradigm in place we still noticed that race conditions would occur, leading to duplication posts. This was a major issue for us and we worked hard to get this under control. To solve duplicate posting, we had to absolutely ensure atomicity when changing states. This is where MongoDB’s findAndModify comes in handy.

db.updates.findAndModify({

'_id':ObjectId(), 'status':'pending'},

{$set:{'status':'processing'}

});

FindAndModify allows one db connection to query for an update with a ‘pending’ status and atomically change the state so that another connection querying for the same update will see that it’s being processed. This general rule of ensuring state changes use findAndModify have helped resolve much of our concurrency issues.

Duplicate posts make us cringe

Spiky Load

One of the really interesting challenges that we face at Buffer is the load from scheduling is incredibly spiky. Many Buffer users have set up their schedules to be hourly on the dot. This means that on a Thursday morning. at 7:59am PST there are 255 updates scheduled to be posted. As soon at 8:00 hits, we schedule close to 5000 updates to be posted. Then when 8:01 it’s back down to ~400 posts.

Here’s a visualization of when updates are due on Thursday (Feb 27) morning.

Yup, it’s spiky load

Posting Delay

As these spikes started to grow above 3k per minute, we started experiencing larger posting delays. When a customer would schedule a post exactly on the hour the post would have been seen on the social network a few minutes later than expected. The worst case we saw was an 8:00am post would be posted at 8:04. This was not acceptable.

After noticing these delays as a trend, we quickly realized that our poor cron job couldn’t handle the task of grabbing due updates and quickly adding them to our SQS queue to post during these spikes. Since the cron runs once a minute, the job would timeout as it could only process a few thousand a minute. Since the cronjob processing was part of the real-time scheduling path, it was at fault to introduce delay, especially for last thousand or so posts that were scheduled on the hour.