The team at Facebook's Boston office is working on many projects that have a major impact on the company. Spiral is one of them. It is a system that uses machine learning for self-tuning real-time services.

Spiral helps to make the thousands of systems running Facebook more efficient, and ultimately helps the performance of Facebook for billions of people.

We caught up with the team at Facebook working on this project to learn more.

What was the original challenge or problem that needed to be solved?

As Facebook grows, we continually face unique scalability challenges. One of these challenges relates to having our engineers manually adjust the code that powers the thousands of services Facebook is built on, with functions ranging from balancing internet traffic to transcoding images to providing reliable storage. Each service is typically maintained in its own way, with approaches that may be difficult to generalize or adapt in the face of fast-paced changes.

Engineers in our Facebook Boston office took on the challenge to find a more efficient way of optimizing these thousands of systems than relying on hand-tuned heuristics — ultimately ensuring engineering teams can scale with the growth of our apps and services.

What is Spiral and how did it address the problem?

Spiral is a self-tuning system that uses real-time machine learning techniques to optimize Facebook backend services. This unique approach gives the system the flexibility to adapt to a constantly changing interconnected web of internal systems, and fundamentally changes the way engineers are building systems.

As a result, Spiral replaces manual maintenance with hands-free automation, allowing for much faster development in minutes rather than in weeks, while significantly reducing operational load, especially configuration and tuning.

When you are developing a solution that needs to work at the scale of Facebook, how do go about the initial thinking and planning?

The key to doing something at Facebook scale is empathy. Since Spiral is infrastructure related, we started by finding out the biggest pain point for our engineers in delivering the best possible Facebook experience. This gives us a very shiny north star to define the broad area and approach of focus.

I often joke that after two decades in the industry, Facebook is the first job where my father (a doctor in India) doesn't have to ask me, “What exactly do you do for a living?” He uses and loves Facebook. So, at a very high level, the first question to be answered is always: how does the solution improve the Facebook experience for people around the world? By automating the operational workload, we free up engineers to be engineers and work on new Facebook products and experiences.

Secondly, the best way to solve a very big problem is to break it up into much smaller problems that can be dealt with more efficiently. For example, Spiral is part of a bigger vision of completely automating the operational grunt work needed to keep Facebook's infrastructure humming. We started with one of the biggest pain points — automating configuration and tuning for real-time services — which was actually a two-part problem.

Part one was working with the engineers who develop the services that Spiral automates to frame the meta-problem related to manually-tuned heuristics. We uncovered that the problem was centered on the fragility and opacity associated with the previous status quo.

Part two was designing a solution that would not compound the problem. Artificial Intelligence/Machine Learning (AI/ML) is a very powerful hammer (or rather a bagful of many hammers). But, as the saying goes, “If all you have is a hammer, everything looks like a nail.” We needed to carefully frame the larger problem and break it down into sub-problems. From there, we looked at constraints and objectives to plan where, when and how much to apply AI/ML techniques. A lot of the success of conceiving, building and shipping this system came from the continuous and steady involvement of the service owners, the end users of Spiral, which made sure that we had a focused approach to building something that resonated accurately with their pains and needs.

How is Spiral leveraging real-time machine learning?

Spiral uses machine learning to create data-driven and reactive heuristics for resource-constrained, real-time services.

At its core, Spiral is a small, embedded C++ library with very few dependencies. Integration with Spiral consists of adding just two call sites to your code: one for prediction and one for feedback.

The prediction call site is the output of the smart heuristic used to make decisions, such as “Should this item be admitted into the cache?” The prediction call is implemented as a fast, local computation and is meant to be executed on every decision.

The feedback call site is for providing occasional feedback, such as “This item expired from the cache without ever being hit, so we should probably not cache items like this one.”

Can you share the background and qualifications of some of the engineers who worked on this project?

The Spiral team is a mix of Distributed Systems and AI/ML engineers. Some of us also have additional backgrounds in Data Science and Mathematics. This unique mix of expertise allowed us to balance the nuances of the diverse domains of AI/ML and systems engineering to deliver a usable product in a very short amount of time.

Interestingly, quite a few of the team members are alumni of MA-based universities, including MIT, Harvard and UMass Amherst.

Facebook's Spiral Team (left to right): Saurav Mohapatra,

Lili Hu,

Alvin Wen,

Jim Cipar, and

Vladimir Bychkovsky

It was mentioned that the approach for developing Spiral was similar to declarative programming. Can you share how the two are similar?

Sure. When using Spiral, engineers express what it means for a system to operate correctly and efficiently in code, as opposed to looking at charts and logs produced by the system to verify correct and efficient operation. This concept of encoding the means of providing feedback to a self-tuning system, as opposed to specifying how to compute correct responses or requests, is similar to declarative programming.

In other words, Spiral uses machine learning to flip the problem of server tuning on its head. Instead of fiddling with config (and looking at metrics) to reach an optimal state, we start with the definition of what optimal performance looks like and then with feedback, steer the system towards that state.

What were some of the major obstacles that the team had to overcome while creating Spiral?

When it comes to infrastructure, failure is really not an option. Infrastructure is the bedrock on which everything else runs. If you break it, the ripples run fast and wide. The best infrastructure is like background music in a movie — if it works as intended, you don't notice it and it accentuates your experience. If the music doesn't fit the scene or action, it's extremely noticeable and distracting.

When we decided to create Spiral for infrastructure automation, we used to joke that the challenge before us was a bit like the movie, Speed. We have a bus moving at a great speed, and we have to achieve our goal without slowing the bus down.

Our prime goal was to be able to deliver the benefits of using AI/ML for auto-tuning the services without causing the slightest disruption to the way these services operate or function for the outside world. Keep in mind: some services get called on every time a person accesses Facebook! Considering Facebook's scale and reach, it was an exciting (and terrifying!) constraint to work with.

Related to your question above, the challenge of using declarative programming for systems is making sure objectives are specified correctly and completely. As with the self-tuning image cache policy above, if the feedback for what should and should not be cached is inaccurate or incomplete, the system will quickly learn to provide incorrect caching decisions, which will degrade performance.

In our experience, precisely defining the desired outcome for self-tuning is one of the hardest parts of onboarding with Spiral. However, we also found that engineers tended to converge on clear and correct definitions after a few iterations.

As a bottom-line, the declarative approach helped us tackle both these challenges together. By working closely with our users (the service owners) in defining what optimal operation meant for them, we were able to drive our own decision making and moved constraints off of the system.

Can you provide an example of how this technology is being used and its impact?

Our team gave a detailed example in our recent blog on how Spiral is helping to reduce the load on the web front-end service that computes query results. It's quite technical, but here's an illustration:

There's an enormous volume of database updates, but only a tiny fraction of them affect the output of the query. If a query is interested in “Which of my friends liked this post?” it is unnecessary to get continuous updates on, for example, when the post was most recently viewed.

Spiral is automating the process of determining when a piece of information is irrelevant (doesn't affect a query result), and when a piece of information does affect a query result.

To learn more about Spiral, check out this Facebook Live Q&A with the Facebook Boston engineering team that brought Spiral to life!