Over the past few months, we’ve built two high-performing recommendation services that handle tens of thousands of queries per second and generate tens of millions connections per day. In this blog post, we want to share our experience of scaling these two services using Futures and, most importantly, how we fine-tuned the details.



The first recommendation service is “Suggested Users.” The SU service fetches candidate accounts from various sources, such as your friends, accounts that you may be interested in, and popular accounts in your area. Then, a machine learning model blends them to produce a list of personalized account suggestions. It powers the people tab on explore, as well as one of the entry points after new users sign up. It is an important means of people discovery on Instagram and generates tens of millions follows per day.

The second service is “Chaining.” Chaining generates a list of accounts that a viewer may be interested in during every profile load. The performance of this service is important - it must be ready to be hit for every profile visit, which translates to over 30,000 queries per second.

These two services share similar infrastructure: they need to make outbound network calls to retrieve suggestions from various sources, load features and rank them before returning them to our Django backend for instrumentation and filtering:



The Thrift threading model

While most of our backend logic lives in Django, we write the services that generate and rank suggestions in C++ using fbthrift. To understand the evolution of our services’ threading model, we need to understand the life cycle of a thrift request. An fbthrift server has three kinds of threads: acceptor threads, I/O threads and worker threads.

When a request comes in:

An acceptor thread accepts the client connection and assigns it to an I/O thread; The I/O thread reads the input data sent by a client, and passes it to a worker thread and the I/O thread will again be responsible for sending outbound requests later; The worker thread deserializes the input data into parameters, calls the request handler of the service in its context and spawns additional threads for outbound calls or computation.

The important part is that the thrift request handler runs in a worker thread and not in an I/O thread. This allows the server to be responsive to clients - even if all the worker threads are busy, the server will still have free I/O threads to send an overloaded response to clients and close sockets.

Synchronous I/O: The initial version

The initial version of the service loaded candidates and features synchronously. To reduce latency, all the I/O calls were issued in parallel in separate threads. At the end of the handler was a join() primitive which blocked until all the threads were done. What this essentially means is that one worker thread could only service one client request at a time, and one single request would block as many threads as the number of outbound calls.

This has several disadvantages:

It leads to a large memory footprint - each thread by default has a stack size of several MBs. We need a separate worker thread to service each client request (and more threads created in the handler to make the I/O calls in parallel). If each request makes M outbound calls, we will have O(M * N) threads waiting for responses. Thread scheduling also becomes a bottleneck in the kernel at around 400 threads. With this model, we had to run several hundred instances of server across many machines to support our QPS, because we are not utilizing CPU resource or memory efficiently.

Clearly, there was room for improvement.

Using non-blocking I/O

The fbthrift offers three ways to handle requests: synchronous, asynchronous and future-based. The latter two offer non-blocking I/O and this is how it works : every I/O thread has a list of file descriptors on whose status change it waits on in an event loop (it detects this status change through the select()/poll()/epoll() system call). When the status of the file descriptor changes to “completed,” the I/O thread calls the associated callback. In order to do non-blocking I/O under this mechanism, two things need to be specified:

A callback which will be called when the I/O is complete The I/O thread whose list should hold the file descriptor corresponding to your I/O operation (This is done by specifying an event base).

This is a typical event-driven programming system. It gives us many nice things:

Waiting on select()/poll()/epoll() puts a thread to sleep, which means it does not busy wait. Thus, it is efficient. To be clear, the synchronous I/O does not necessarily busy wait either, but it requires allocating one thread per I/O call. One I/O thread can take care of the I/O of multiple outbound requests. This reduces the memory footprint and synchronization costs associated with a large number of threads, and leads to a more scalable system. One worker thread does not need to wait for all the I/O associated with a single client request to complete before moving on to the next client request. Thus, one worker thread can perform computation for multiple concurrent client requests. Once again, this gives us the benefits mentioned in 2.

Futures : A better asynchronous - programming paradigm

At this point, we were in a pretty good shape in terms of scalability. However, the callback based programming syntax has many deficiencies. For one, it leads to code growing sideways when callbacks are nested, something known as the “callback pyramid.”

doIO1([..](Data io1Result){

doIO2([..](Data io2Result) {

doIO3([..](Data io3Result){

....

}, io2Result)

}, io1Result)

}, io1Input)





This has an impact on code readability and maintainability, and we needed a different async programming paradigm. Two other paradigms are very popular at Facebook - the async/await paradigm used in Hack, which is similar to generators, and the Futures paradigm (through the folly::Futures open-source framework). Futures improve upon the callback-based paradigm with their ability to be composed and chained together. For example, the above code can be written as follows in this paradigm:

doIO1(io1Input)

.then([..](Data io1Result) {

return doIO2(io1Result);

})

.then([..](Data io2Result){

return doIO3(io2Result);

})

.then([..](Data io3Result){

...

})

.onError([](const std::exception& e){

// handle errors in io1, io2 or io3

})



This is an example of futures chaining. This solves the ‘callback pyramid’ problem, and makes the code much more readable. It provides the ability to combine Futures together, and also provides very nice error handling mechanisms. (Checkout the github for more examples and features of this API.)

Offloading handler execution from I/O threads

After moving to Futures, we had a performant system with clear, readable code. At this point, we did some profiling to find opportunities for fine-tuning and improvement. One curious thing we noticed was that there was significant delay between the completion of the I/O calls and starting the execution of the 'then’ handler. In fact, the callbacks in the Future chain above are executed in I/O thread contexts, and some of the work they do is non-trivial. This is the source of bottleneck - I/O threads are limited in number and are meant to service I/O status changes. Executing handlers in their context meant that they could not respond to I/O status changes fast enough, causing the delay in execution of handlers. Meanwhile, the worker threads were sitting idle, leading to low CPU utilization. The solution was simple - execute handlers in the context of worker threads. Fortunately, the Futures API provides a very nice interface to achieve this:

doIO1(io1Input)

.then(getExecutor() , [..](Data io1Result){

// Do work

})



This relieves I/O threads of actual computation such as ranking and reduced our I/O threads busy time by 50%. In addition, this helps prevent cascading failures where I/O threads are all busy and thus none of the callbacks are executed fast enough, which causes most of the requests time out.

Conclusion

With folly Futures, our services can fully exploit system resources and are more reliable and efficient than the ones with synchronous I/O.

We were able to increase the peak CPU utilization of the Suggested User service from 10-15% to 90% per instance. This enabled us to reduce the number of instances of the Suggested Users service from 720 to 38.

Chaining achieved 40 ms average end-to-end latency and under 200ms p99, handling more than 30,000 queries per second. It runs on only 38 instances (each instance handles around 800 requests per second).

By Zhenghao Qian and Gautam Sewani, Software Engineers on the Instagram Data Team

Thanks to Sourav Chatterji, Thomas Dimson, Michael Gorven and Facebook Wangle Team - they all have made great contributions to this effort.