License to Queue

Posted by by

A few months ago I joined an exciting project with fellow Shiner Graham Polley, who you might know from such hits as Put on your streaming shoes. This is a follow-up article, discussing the elegant way in which we solved a hideous asynchronous limitation in PHP.

My role on the project was DevOps-based, and I was there to build some infrastructure using Amazon Web Services. As the cool kids would put it, I was there to put our client’s enterprise application INTO the cloud, or, more succinctly, to build a solution coupling services from two rival cloud service providers and provide a new league of scalability and flexibility.

The solution was pretty simple, but, like any simple solution, the little complexities come out along the way, and when you’re least expecting them.

At the time, I had already read Graham’s blog post about streaming data into Google’s BigQuery, and so somewhere in the deep, dark catacombs of my mind, I knew that the PHP async issue was lurking, waiting for an opportune moment to rear its ugly head.

Put simply, if you need to perform a HTTP request using PHP, which is exactly what the BigQuery client library for PHP does for every database insert, the client browser is held up until the TCP session has completed. PHP offers no graceful method for performing an asynchronous, fire-and-forget request.

It wasn’t until some point in our preliminary load testing of the Apache web servers that I stumbled upon the notorious PHP asynchronous impossibility for myself. I had been sifting through a few relatively heavy-hitting PHP scripts that seemed to be causing much more load than they should, and I happened upon a section of code that… *deep breath* opens a socket back to the local Apache HTTP server to perform a HTTP request of another PHP file, which is sitting in the same directory as itself.

Fortunately, despite all of the commotion and shock, I managed to capture a selfie, which I’ve included here to illustrate to you the raw emotion I felt at the time:

The shock to my system caused me to lose all memory of Graham’s earlier blog, and he had to explain it to me all over again. Why had such a hack been implemented in a production environment?

In short, PHP.

Hint: Now is a good time to skim through Graham’s post again if you need to freshen your memory.

A little help from Q

Over the next day or two, this issue kept bugging me. I felt dirty knowing that this hack was present in the shiny new environment that I’d built. It’s the kind of stink that, no matter how long you sit in it, your nose still doesn’t adjust.

We needed a method of pumping those requests somewhere into a queue, to be processed independently of the client-side browsing experience. If it took several seconds for a BigQuery insert process to return, that should not be reflected in the front-end user experience.

Fortunately, a solution presented itself quite quickly: Redis. We were already using it on other parts of the project and, while you might know it best as an in-memory key store, it can also be used as a queuing mechanism. Being in-memory was OK with us as we didn’t really need data persistence – it’s intended to be entirely transient. Consequently, setting up our queue was super-simple.

Redis has a really neat pub/sub feature, where you can simply subscribe to a named ‘channel’ in Redis, and listen. A client will ‘publish’ to the channel, and your listener will immediately be sent the data. Awesome, but there is a catch. Messages published to a channel while a listener is not listening are lost forever. This system is intended to be handling events in real-time, so pub/sub was given serious consideration, but ultimately decided against for this reason.

Instead, we chose Redis’ push/pop mechanism, which is an actual queue.

Push and Pop

Push and pop are pretty standard methods for managing a FIFO stack. Push (rpush) will add a message to the right side of the stack, and pop (lpop) will return the left-most value and remove it from the stack. It’s a queue.

With this method, you need to loop and poll each queue yourself, but it does afford a pretty important feature: It will maintain a backlog of messages if they are not processed in time. This allows us to sustain spikes in load, or if the queue processor script is down temporarily it can catch up.

It seemed like a pretty natural logical separation to go with one queue per BigQuery table, and makes for good looking code on the client side (using php-redis):

$redis->rpush("table", "data");

Simple!

My weapon of choice

The next decision was another easy one. What will I use to write the queue processor?

Google offers a pretty mature client library for both Java and Python, which means really simple authentication and inserting into BigQuery using either language.

Python was my weapon of choice to write the queue processor. It offers a pretty decent client library for each component in my solution, and it was really quick to get a working prototype.

Note: There are BigQuery client libraries for a few other languages, all of which are either in alpha, beta or weren’t really applicable my situation.

Performance

This little engine started off pretty simple, but it has a lot of responsibility and some pretty hefty performance requirements, so as it was developed and tested, things got a bit more complex. Currently we are handling roughly 1000 inserts per minute, and we are barely out of proof-of-concept stage. This will be scaled up a lot more, and therefore needs to be able to handle itself – without solving a code problem using infrastructure, that is.

Parallelism

First of all, multithreading is an absolute must. It needs to handle many simultaneous HTTP sessions, and without this, an enormous backlog would be created within seconds.

For example, if a single transaction with BigQuery took 1500ms (1.5 seconds), with a single threaded queue processor we would reach a maximum of 40 requests per minute. Not even close.

The threading Python library makes writing a multithreaded application a simple exercise.

Coalesced inserts

Even with multithreading, and a large number of concurrent threads working away, performance was just not bad. The queue processor and the system looked like it was working reasonably hard, but it was only just keeping up with the load.

One database row insert per thread is pretty wasteful, though. The majority of the turnaround time for each insert is protocol overhead.

The data format for BQ inserts is simple JSON, and a quick glance at an example makes it obvious that multiple rows are not only allowed, but encouraged:

{ "rows": [ { "json": { ... } }, { "json": { ... } }, { "json": { ... } } ] }

By coalescing multiple row inserts – multiple Redis queue messages – into one HTTP request, throughput skyrocketed and the number of concurrent threads dropped. The Python code became a bit more complex to implement this, but it was definitely worth it.

The queue processor will bank up messages from each Redis queue until a threshold has been reached, and then they will be coalesced and dispatched to BigQuery in one go. In my implementation, there are actually three triggers that will cause these banks to be emptied, and messages dispatched to BigQuery:

The bank reaches a max value (currently 24 messages) The number of active threads drops below a lower threshold (currently 6 threads) A cleaner thread runs an unconditional dispatch every 1 second. This simply ensures nothing gets left in the queue in an unexpected situation

So far we’ve been able to process a backlog of 8192 inserts in around 1min 30sec, or ~5460 inserts per minute, using the smallest Amazon EC2 instance available (t2.micro).

This coalesce method does mean that a message, after being picked up from the queue, is sitting in memory a fraction longer, but it is really only a matter of milliseconds in this case. A delay that was deemed acceptable.

Going sideways

Now that we have the queue processor running reasonably fast, it’s time to think a little more about scalability and resiliency. Adding more/faster CPUs will help to a point, but the returns are diminishing, and not practical; changing the server profile in AWS requires a shutdown of the host.

In addition to scalability, I need some sort of resiliency. If the server falls in a heap, any new row inserts will be simply lost, along with any backlog that has accumulated in Redis. That’s the inherent nature of this asynchronous, fire-and-forget solution that we’ve been searching for, though. So then, what’s the plan?

Fortunately, this problem is pretty easily solved once again. Scaling this solution horizontally makes complete sense, and requires no alteration to the existing set up. Just a few more EC2 instances running an identical configuration and an Elastic Load Balancer sitting in front. Done.

This proved to be a really solid way of increasing throughput, and provides some safety for our data, too.

I’ll get you next time, Bond

What’s next? Well, a nice way to avoid losing in-flight messages if a system failure occurred would be good. At the moment we just avoid a backlog like my dog avoids bath time. If you’ve got ideas, I’m all ears!

This was a fun problem to solve, and I’ve learnt a good lesson in the process: Taking a step back from a problem that you’re trying to solve is always advantageous. It is very easy to become blinkered in your approach, and to try to solve a problem using whatever tools are immediately in front of you, even if it’s not the right tool for the job.

Moving to AWS also opened some doors for us and allowed us to be creative with the approach to this issue. Before the migration, our toolset was very limited.

I’m super keen to hear from anyone else out there tackling a similar issue, and the measures you’ve taken to solve the problem or to work around it. If you’ve worked on something like this, drop me a note to say hello!