Updated March 2019.

Ever wondered which database Facebook (FB) uses to store the profiles of its 2.3B+ users? Is it SQL or NoSQL? How has FB database architecture evolved over the last 15+ years? As an engineer in FB database infrastructure team from 2007 to 2013, I had a front row seat in witnessing this evolution. There are invaluable lessons to be learned by better understanding the database evolution at the world’s largest social network, even though most of us won’t be facing exactly the same challenges in the near future. This is because the fundamental tenets that underpin FB’s internet-scale, globally-distributed architecture today apply to many business-critical enterprise apps such as multi-tenant SaaS, retail product catalog/checkout, travel bookings and gaming leaderboards.

Initial Architecture

As any FB user can easily understand, his/her profile is not simply a list of attributes such as name, email, interests and so on. It is in fact a rich social graph that stores all friend/family relationships, groups, check-ins, likes, shares and more. Given the data modeling flexibility of SQL and the ubiquity of MySQL when FB started, this social graph was initially built as a PHP application powered by MySQL as the persistent database and memcache as a “lookaside” cache.

Facebook’s original database architecture

In the lookaside caching pattern, the application first requests data from the cache instead of the database. If the data is not cached, the application gets the data from the backing database and puts it into the cache for subsequent reads. Note that the PHP application was accessing MySQL and memcache directly without any intermediate data abstraction layer.

Growing Pains

FB’s meteoric success from 2005 onwards put enormous strain on the simplistic database architecture highlighted in the previous section. Following were some of the growing pains FB engineers had to solve in a short period of time.

Loss of Developer Agility

Engineers had to work with two data stores with two very different data models: a large collection of MySQL master-slave pairs for storing data persistently in relational tables, and an equally large collection of memcache servers for storing and serving flat key-value pairs derived (some indirectly) from the results of SQL queries. Working with the database tier now mandated first gaining intricate knowledge of how the two stores worked in conjunction with each other. Net result was loss in developer agility.

Application-Level Database Sharding

Inability of MySQL to scale write requests beyond one node became a killer problem as data volumes grew leaps and bounds. MySQL’s monolithic architecture essentially forced application-level sharding very early on. This meant that the application now tracked which MySQL instance is responsible for storing which user’s profile. The development and operational complexity grows exponentially when the number of such instances grow from 1 to 100s and thereafter explode into 1000s. Note that adherence to such an architecture meant that application no longer uses the database to perform any cross-shard JOINs and transactions, thereby giving up on the full power of SQL (as a flexible query language) in order to scale horizontally.

Multi-Datacenter, Geo-Redundant Replication

Handling datacenter failures also became a critical concern which meant storing MySQL slaves (and corresponding memcache instances) in multiple geo-redundant datacenters. Perfecting and operationalizing failovers was no easy feat in itself but given the master-slave asynchronous replication, recently committed data would still be missing whenever such a failover was undertaken.

Loss of Consistency Between Cache & DB

The memcache in front of a remote-region MySQL slave cannot immediately serve strongly (aka read-after-write) consistent reads because of the asynchronous replication between the master and the slave. And, the resulting stale reads in the remote region can easily lead to confused users. E.g. a friend request can show up as accepted to one friend while showing up as still pending to the other.

Enter TAO, a NoSQL Graph API on Sharded SQL

Early 2009 FB started building TAO, a FB-specific NoSQL graph API built to run on sharded MySQL. The goal was to solve the problems highlighted in the previous section. TAO stands for “The Associations and Objects”. Even though the design for TAO was first published as a paper in 2013, the implementation for TAO was never open sourced given the proprietary nature of the FB social graph.

TAO represented data items as nodes (objects) and relationships between them as edges (associations). The FB application developers loved the API because they could now easily manage database updates and queries necessary for their application logic with no direct knowledge of MySQL or even memcache.

Architecture

As shown in the figure below, TAO essentially converted FB’s existing 1000s of manually-sharded MySQL master-slave pairs to a highly-scalable, auto-sharded, geo-distributed database cluster. All objects and associations in the same shard are stored persistently in the same MySQL instance, and are cached on the same set of servers in each caching cluster. Placement of individual objects and associations can be directed to specific shards at creation time when needed. Controlling the degree of data collocation proved to be an important optimization technique for providing low latency data access.

SQL-based access patterns such as cross-shard ACID transactions and JOINs were disallowed in TAO as a means to preserve such low latency guarantees. However, it supported non-atomic two-shard writes in the context of an association update (whose two objects may be in two different shards). In case of failures after one shard update but before the second shard update, an asynchronous repair job would clean up the “hanging” association at a later time.

Shards can be migrated or cloned to different server in the same cluster to balance the load and to smooth out load spikes. Load spikes were common and happen when a handful of objects or associations become extremely popular as they appear in the News Feeds of tens of millions of users at the same time.

TAO Architecture (Source: FB Engineering Blog)

Is there a General Purpose Enterprise Solution?

FB had no choice but to massively scale the MySQL database layer responsible for its user’s social graph. Neither MySQL nor the other SQL databases available at that time could solve this problem on their own. So, FB used its significant engineering might to essentially create a custom database query layer that abstracted the underlying sharded MySQL databases. In doing so, it forced its developers to completely give up on SQL as a flexible query API and adopt TAO’s custom NoSQL API.

Most of us in the enterprise world do not have Facebook-scale problems but nevertheless want to scale out SQL databases on-demand. We love SQL for its flexibility and ubiquity, which means we want to scale without giving up on SQL. Is there a general purpose solution for enterprises like us? The answer is Yes!

Hello Distributed SQL!

Monolithic SQL databases have been trying for 10+ years to become distributed in order to solve the horizontal scaling problem. As “Rise of Globally Distributed SQL Databases” highlights, the first wave of such databases were called NewSQL and included databases such as Clustrix, NuoDB, Citus and Vitess. These have had limited success in displacing manually sharded SQL databases. The reason is that the new value created is not enough to radically simplify the developer and operations experience. Clustrix and NuoDB mandate specialized, highly reliable, low latency datacenter infrastructure — modern cloud native infrastructure looks exactly the opposite. Citus and Vitess simplify the operations experience to an extent by auto-sharding the database but then handicap the developer by not giving him/her a single logical distributed SQL database.

We are now in the second generation of distributed SQL databases where massive scalability and global data distribution are built into the database layer as opposed to 10 years back when Facebook had to build these features into the application layer.

Inspired by Google Spanner

While FB was building TAO, Google was building Spanner, a completely new globally-consistent database to solve very similar challenges. Spanner’s data model was less of a social graph but more of a traditional, random-access OLTP workload that manages Google’s users, customer organizations, AdWords credits, GMail preferences and more. Spanner was first introduced to the world in the form of a design paper in 2012. It started out in 2007 as a transactional key-value store but then evolved into a SQL database. The shift to SQL as the only client language accelerated as Google engineers realized that SQL has all the right constructs for agile app development especially in the cloud native era where infrastructure is a lot more dynamic and failure-prone than the highly reliable private datacenters of the past. Today multiple modern databases (including YugabyteDB) have brought the Google Spanner design to life completely in open source.

Handling Internet-Scale Data Volume with Ease

Sharding is completely automatic in the Spanner architecture. Additionally, shards become auto balanced across all available nodes as new nodes are added or existing nodes are removed. Microservices needing massive write scalability can now rely on the database directly as opposed to adding new infrastructure layers similar to the ones we saw in the FB architecture. No need for an in-memory cache (that offloads read requests from the database thereby freeing it up for serving write requests) and also no need for a TAO-like application layer that does shard management.

Extreme Resilience Against Failures

A key difference between Spanner and the legacy NewSQL databases we reviewed in the previous section is Spanner’s use of per-shard distributed consensus to ensure each shard (and not simply each instance) remains highly available in the presence of failures. Similar to TAO, infrastructure failures always affect only a subset of data (only those shards whose leaders get partitioned away) and never the entire cluster. And, given the ability of the remaining shard replicas to auto-elect a new leader in seconds, the cluster exhibits self-healing characteristics when subjected to failures. The application remains transparent to these cluster config changes and continues to work normally without outages or slowdowns.

Seamless Replication Across the Globe

The benefit of a globally-consistent database architecture is that microservices needing absolutely correct data in multi-zone and multi-region write scenarios can finally rely on the database directly. Conflicts and data loss observed in typical multi-master deployments of the past do not occur. Features such as table-level and row-level geo-partitioning ensure that data relevant to the local region remains leadered in the same region. This ensures that the strongly consistent read path never incurs cross-region/WAN latency.

Full Power of SQL & Distributed ACID Transactions

Unlike the legacy NewSQL databases, SQL and ACID transactions in their complete form can be supported in the Spanner architecture. Single-key operations are by default strongly consistent and transactional (the technical term is linearizable). Single-shard transactions by definition are leadered at a single shard and hence can be committed without the use of a distributed transaction manager. Multi-shard (aka distributed) ACID transactions involve a 2-Phase Commit using a distributed transaction manager that also tracks clock skews across the nodes. Multi-shard JOINs are similarly handled by querying data across the nodes. The key here is that all data access operations are transparent to the developer who simply uses regular SQL constructs to interact with the database.

Summary

The data infrastructure scaling stories at any of the technology giants including FB and Google makes for great engineering learning. At FB, we took the path of building TAO which allowed us to preserve our existing investment in sharded MySQL. Our application engineers lost the ability to use SQL but gained a bunch of other benefits. Engineers at Google were faced with similar challenges but they chose a different path by creating Spanner, an entirely new SQL database that can horizontally scale, seamlessly geo-replicate and easily tolerate infrastructure failures. FB and Google are both incredible success stories so we cannot say that one path was better than the other. However, when we expand the horizon to general-purpose enterprise architectures, Spanner comes ahead of TAO because of all the reasons highlighted in this post. By building YugabyteDB’s storage layer on the Spanner architecture, we believe we can bring the developer agility of the technology giants to the enterprises of today.

What’s Next?