This write-up is a comprehensive insight into the persistence layer of Facebook & the first of the real-life software architecture series. It takes a deep dive into questions like what database does Facebook use? Is it a NoSQL or a SQL DB? Is it just one or multiple databases? Or a much more complex polyglot database architecture? How is Facebook’s backend handling billions of data fetch operations every single day?

So, without further ado.

Let’s get started.

For a full list of all the real-world software architecture posts on the blog here you go.





1. What Database Does Facebook Use?

If you need a quick answer here it is:

MySQL is the primary database used by Facebook for storing all the social data. It started with the InnoDB MySQL database engine & then wrote MyRocksDB, which was eventually used as the MySQL Database engine.

Memcache sits in front of MySQL as a cache.

To manage BigData Facebook uses Apache Hadoop, HBase, Hive, Apache Thrift, PrestoDB. All these are used for data ingestion, warehousing & running analytics.

Apache Cassandra is used for the inbox search

Beringei & Gorilla, high-performance time-series storage engines are used for infrastructure monitoring.

LogDevice, a distributed data store is used for storing logs.

Now let’s delve into the specifics.





1.1 PolyGlot Persistence Architecture

Facebook, today, is not a monolithic architecture. It might have been at one point in time like LinkedIn was but definitely not today.

The social network as a whole consists of several different loosely coupled components plugged in together like Lego blocks. For instance, photo sharing, messenger, social graph, user post etc. are all different loosely coupled microservices running in conjunction with each other. And every microservice has a separate persistence layer to keep things easy to manage.



Polyglot persistence architecture has several upsides. Different databases with different data models can be leveraged to implement different use cases. The system is more highly available, easy to scale.





1.2 What is a Polyglot Persistence System?

Put in simple words, Polyglot Persistence means using different databases having unique features & data models for implementing different use cases/business requirements.

For instance, Cassandra, Memcache would serve different requirements as opposed to a traditional MySQL Db.

If we have ACID requirements like for a financial transaction, MySQL Db would serve best. On the other hand, when we need to access data fast, we would go for Memcache or when we are not so much concerned about data consistency & duplicate data but need a fast highly available, eventually consistent persistence system, a NoSQL solution would fit best.



Zero to Software/Application Architect learning track is a series of four courses that I am writing with an aim to educate you, step by step, on the domain of software architecture & distributed system design. The learning track takes you right from having no knowledge in it to making you a pro in designing large scale distributed systems like YouTube, Netflix, Google Stadia & so on. Check it out.

Moving on…

So, now we kind of have a clue that there is no silver bullet. Well, naturally we aren’t hunting werewolves here. We are put upto a much harder task, connecting the world online.

For the same reason, Facebook too uses several different databases to fulfil its different data model & persistence requirements.

1.3 Does Facebook Use A Relational Database System?

MySQL

This is the primary database that Facebook uses with different engines. Different engines? I’ll get to that.

Facebook leverages the social graph to track & manage all the user events on the portal. Such as who liked whose post. Mutual friends. Which of your friends already ate at the restaurant you visited etc. And the social graph is powered by MySQL.

Initially, Facebook engineering team started with MySQL InnoDB engine to persist social data. Data was compressed & stored even then it just took too much space. And more space usage meant more hardware was required which naturally spiked the data storage costs.

InnoDB MySQL Storage Engine

InnoDB is a default MySQL storage engine that provides high reliability & performance.

InnoDB Architecture

MyRocks MySQL Storage Engine Written by Facebook

To deal with the space issues. The engineering team at Facebook wrote a new MySQL database engine MyRocks. Which reduced the space usage by 50% & helped improve the write-efficiency.

Over time Facebook migrated it’s user-facing database from InnoDB to MyRocks engine. The migration didn’t had much difficulty since the core tech MySQL was the same, just the DB engines changed.

After the migration, for a long while, the engineering team ran the data consistency verification checks to check if everything went smooth.

Several benchmark processes were run to evaluate the DB performance & the results stated that MyRocks instance size turned out to be 3.5 times smaller than the InnoDB instance uncompressed & 2 times smaller than InnoDB instance compressed.

WebScale SQL

MySQL is the most popular persistence technology ever & is naturally deployed by big guns.

WebScale SQL is a collaboration amongst engineers from several different companies such as Google, Twitter, LinkedIn, Alibaba running MySQL in production at scale to build & enhance MySQL features that are required to run in large scale production environments.

Facebook has one of the largest MySQL deployments in the world. And it shares a common WebScale SQL development codebase with the other companies.

The engineering team is Facebook is preparing to move it’s production tested versions of table, user & compression statistics into WebScaleSQL.

To educate yourself on software architecture from the right resources, to master the art of designing large scale distributed systems that would scale to millions of users, to understand what tech companies are really looking for in a candidate during their system design interviews. Read my blog post on master system design for your interviews or web startup.





2. RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

Initially, Facebook wrote an embeddable persistent key-value store for fast storage called RocksDB. Which being a key-value store had some advantages over MySQL. RocksDB was inspired by LevelDB a datastore written by Google.

It was sure fast but did not support replication or an SQL layer & Facebook engineering team wanted those MySQL features in RocksDB. Eventually, they built MyRocks, an open-source project which had RocksDB as a MySQL storage engine. Pretty geeky!!

RocksDB fits best when we need to store multiple terabytes of data in one single database.

Some of the typical use cases for RocksDB:

1. Implementing a message-queue that supports a large number of inserts & deletes.

2. Spam detection where you require fast access to your dataset.

3. A graph search query that needs to scan a data set in real-time.





3. Memcache – Distributed Memory Caching System

Memcache is being used at Facebook right from the start. It sits in front of the database, acts as a cache & intercepts all the data requests intended for the database.

Memcache helps reduce the request latency by a large extent. Eventually providing a smooth user experience. It powers the Facebook social graph having trillions of objects & connections, growing every moment.

Memcache is a distributed memory caching system, used by big guns in the industry such as Google cloud.

Facebook Caching Model

When a user changes the value of an object, the new value is written to the database & deleted from Memcache. The next time user requests that object, the value is fetched from the database & written to Memcache. Now after this for every request, the value is served from Memcache until it is modified.

This flow appears pretty solid until the database & the cache are deployed in a distributed environment. Now the concept of eventual consistency comes into effect.

The instances of an app are geographically distributed. When one instance, the node of a distributed database is updated in say in Asia, it takes a while for the changes to cascade to all of the instances of the database running. To get a uniform consistent value across all the instances. This is known as eventual consistency.

Now right at the point in time when the value is updated in the Asia instance if a person requests the object from America, he will receive the old value from the cache.

Well, this is typically a tradeoff between high availability & data consistency.

To understand how the data flows in a distributed environment check out my Web Application Architecture & Software Architecture 101 course





4. How Does Facebook Manage Big Data?

4.1 Apache Hadoop

Well, this shouldn’t come as a surprise, Facebook has an insane amount of data that grows every moment. And they have an infrastructure in place to manage such an ocean of data.

Read this article on data ingestion to understand why it is super important for businesses to manage & make sense of large amounts of data?

Apache Hadoop is the ideal open-source utility to manage big data & Facebook uses it for running analytics, distributed storage & for storing MySQL database backups.

Besides Hadoop, there are also other tools like Apache Hive, HBase, Apache Thrift that are used for data processing

Facebook has open-sourced the exact versions of Hadoop which they run in production. They have possibly the biggest implementation of the Hadoop cluster in the world. Processing approx. 2 petabyte of data per day in multiple clusters at different data centres.

Facebook messages use a distributed database called Apache HBase to stream data to Hadoop clusters.

Another use case is collecting user activity logs in real-time in Hadoop clusters.





4.2 Apache HBase Deployment at Facebook

HBase is an open-source, distributed database, non-relational in nature, inspired by Google’s BigTable. It is written in Java.

Facebook Messaging Component originally used HBase, running on top of HDFS. It was chosen by the engineering team due to the high write throughput & low latency it provided. The other features, it being a distributed project, included horizontal scalability, strong consistency & high availability. Now the messenger service uses RocksDB to store user messages.

HBase is also used in production by other services such as the internal monitoring system, search indexing, streaming data analysis & data scraping.

Migration Of Messenger Storage From HBase To RocksDB

The migration of the messenger service database from HBase to RocksDB enabled Facebook to leverage flash memory to serve messages to its users as opposed to serving messages from the spinning hard disks. Also, the replication topology of MySQL is more compatible with the way Facebook data centers operate in production. This enabled the service to be more available and have better disaster recovery capabilities.





4.3 Apache Cassandra – A Distributed Wide-Column Store

Apache Cassandra is a distributed wide-column store built in house at Facebook for the Inbox search system. Cassandra was written to manage structured data & scale to a very large size across multiple servers with no single point of failure.

The project runs on top of an infrastructure of hundreds of nodes spread across many data centres. Cassandra is built to maintain a persistent state in case of node failures. Being distributed features like scalability, high performance, high availability are inherent.





4.4 Apache Hive – Data Warehousing, Query & Analytics

Apache Hive is a data warehousing software project built on top of Hadoop for running data query & analytics.

At Facebook, it is used to run data analytics on petabytes of data. The analysis is used to gain an insight into the user behaviour, it helps in writing new component & services, & understanding user behaviour for Facebook Ad Network.

Hive inside Facebook is used to convert SQL queries to a sequence of map-reduce jobs that are then executed on Hadoop. Writing programmable interfaces & implementations of common data formats & types, to store metadata etc.





5. Presto DB – A High Performing Distributed Relational Database

PrestoDB is an open-source performant, distributed RDBMS primarily written for running SQL queries against massive amounts, like petabytes of data. It’s an SQL query engine for running analytics queries. A single Presto query can combine data from multiple sources, enabling analytics across the organization’s system.

The DB has a rich set of capabilities enabling data engineers, scientists, business analysts process Tera to Petabytes of data.

Facebook uses PrestoDB to process data via a massive batch pipeline workload in their Hive warehouse. It also helps in running custom analytics with low latency & high throughput. The project has also been adopted by other big guns such as Netflix, Walmart, Comcast etc.

The below diagram shows the system architecture of Presto

The client sends SQL query to the Presto co-ordinator. The coordinator then parses, analyzes & plans the query execution. The scheduler assigns the work to the nodes located closest to the data & monitors the processes.

The data is then pulled back by the client at the output stage for results. The entire system is written in Java for speed. Also, it makes it really easy to integrate with the rest of the data infrastructure as they too are written in Java.

Presto connectors are also written to connect with different data sources.





6. Beringei: A High-Performance Time Series Storage Engine

Beringei is a time series storage engine & a component of the monitoring infrastructure at Facebook. The monitoring infrastructure helps in detecting the issues & anomalies as they arise in real-time.

Facebook uses the storage engine to store system measurements such as product stats like how many messages are sent per minute, the service stats, for instance, the rate of queries hitting the cache vs the MySQL database. Also, the system stats like the CPU, memory & network usage.

All the data goes into the time series storage engine & is available on dashboards for further analysis. Ideally in the industry Grafana is used to create custom dashboards for running analytics.





7. Gorilla: An In-Memory Time Series Database

Gorilla is Facebook’s in-memory time series database primarily used in the monitoring & analytics infrastructure. It is intelligent enough to handle failures ranging from a single node to entire regions with little to no operational overhead. The below figure shows how gorilla fits in the analytics infrastructure.

Since deployment Gorilla has almost doubled in size twice in the 18-month period without much operational effort which shows the system is pretty scalable. It acts as a write-through cache for monitoring data gathered across all of Facebook’s systems. Gorilla reduced Facebook’s production query latency by over 70x when compared with the previous stats.





8. LogDevice: A Distributed DataStore For Logs

Logs are the primary way to track the bugs occurring in production, they help in understanding the context & writing a fix. No system can run without logs. And a system of the size of Facebook where so many components are plugged in together generates a crazy amount of logs.

To store & manage all these logs Facebook uses a distributed data store for logs called LogDevice.

It’s a scalable and fault-tolerant distributed log system. In comparison to a file system which stores data as files, LogDevice stores data as logs. The logs are record-oriented, append-only & trimmable. The project has been written from ground up to serve multiple types of logs with high reliability & efficiency at scale.

The kind of workloads supported by LogDevice are event logging, stream processing, ML training pipelines, transaction logging, replicated state machines etc.





9. Conclusion

Guys this is pretty much it. I did quite a bit of research on the persistence layer of Facebook & I think I’ve covered all the primary databases deployed at Facebook.

Also, this article will be continually updated as Facebook’s systems evolve. I will also cover the architecture & flows of different components of Facebook such as the messenger, news feed etc. in separate articles.

Recommended Read: Master System Design For Your Interviews Or Your Web Startup





Read – Best resources to learn software architecture & system design

I’ve put together a list of resources (online courses + books) that I believe are super helpful in building a solid foundation in software architecture & designing large scale distributed systems like Facebook, YouTube, Gmail, Uber & so on.





Subscribe to the newsletter to stay notified of the new posts.

Your subscription could not be saved. Please try again. Please check your inbox to confirm the subscription. Subscribe to my newsletter Get updates on the new content published on the blog by joining my newsletter Opt-in I consent to receiving your newsletter via email SUBSCRIBE







If you liked the article, share it with your folks. You can follow 8bitmen on social media to stay notified of the new content published –

Twitter

Facebook

LinkedIn



I am Shivang, here is my LinkedIn profile in case you want to say Hello!





More On the Blog

YouTube Database – How Does It Store So Many Videos Without Running Out Of Storage Space?

Web Application Architecture & Software Architecture 101 Course

Facebook Real-time Chat Architecture Scaling With Over Multi-Billion Messages Daily

What Database Does Twitter Use? – A Deep Dive

How Does PayPal Processes Billions of Messages Per Day with Reactive Streams?

How does Linked-In identify it’s users online