Google recently published a paper titled "Procella: Unifying serving and analytical data at YouTube". In it they describe a new query processing engine built on top of various primitives proprietary to Google. This paper stood out to me for a number of reasons: this is the SQL query engine that runs YouTube, and it has been for a number of years now, and it contains optimisations that run queries faster than BigQuery, by two orders-of-magnitude in some cases. Every day this software serves 100s of billions of queries that span 10 PB of data.

In this blog post I'll pick apart the paper and compare its descriptions to my understanding of various Hadoop-related technologies.

Please keep in mind this post is speculative and based on academic papers Google has published. I was sub-contracted to build the backend to the Consumer Barometer for Google in 2014 but I didn't have any direct exposure any technologies covered in the Procella paper.

The Query Engine Procella was built to expose a SQL interface for several non-OLTP workloads in production at YouTube. Google already have Dremel (known commercially as BigQuery), Mesa, Photon, F1, PowerDrill and Spanner but none of these were a perfect match for YouTube. Moving to a single, comprehensive SQL interface means not needing individual data silos for reporting and dashboards, time-series monitoring with automatic down-sampling, ad-hoc analysis and real-time counters. It also means there is only a single API to learn: Standard SQL. Procella acts much like Presto in that it's a query engine on top of decoupled compute and storage layers. In this case Borg, Google's precursor to Kubernetes, is acting as a distributed compute layer, complete with container and job scheduling functionality and Colossus, a distributed storage solution that shares many parallels with HDFS.

The Storage Layer No local storage is available to Procella at any point. The only durable storage system made available is Colossus which operates via network calls rather than via the local Linux File System API. With Colossus, many operations found in POSIX file systems are unavailable. For example, a file can be opened and appended to but its contents cannot be modified afterwards. Also, because at least one expensive network call is required for any read or write operation, file metadata needs to be aggressively cached to avoid any performance penalties. These characteristics aren't too dissimilar to HDFS or AWS S3. Colossus handles encryption, durability, hot-spotting and replication across low-cost, geographically distributed storage. Colossus probably has the same 1.5x storage overhead requirement as HDFS' Erasure coding as there are two patents relating to the matter filed by Google and a paper was published stating Colossus uses Erasure coding. An 8-trillion-byte disk drive when formatted could offer 7.3 TB of capacity. This would allow for 4.8 TB of data to be stored using a Reed-Solomon-like Erasure coding scheme. This would is more trustworthy than simple 3x replication.

Data Storage Strategies Tables are used to represent datasets in Procella. Datasets are spread among multiple files stored on Colossus. Each file will have zone maps, bitmaps, bloom filters, partitions, sort keys and inverted indices produced. These statistics are stored in the Metadata Store which can use either Bigtable or Spanner for its underlying structured storage. This contrasts Parquet and ORC which store most of these statistics in the individual files themselves. Given these statistics live outside of the files there is no need to open each data file individually to see if it should be involved is producing the output of a query. This reduces the networking and I/O overheads compared to how Presto queries data stored in columnar format on HDFS or S3. The paper also mentions these statistics being aggressively cached in LRU-based caches. BTree-style secondary indices are specifically mentioned in the Procella paper as not being supported. Data is stored by column rather than by row. This, along with the secondary structures above, help exclude scanning data not relevant to a query's result. The only workload in which this won't provide a performance improvement is when the entire row of a record it being examined (such as when being converted to another format or serialised in its entirety for output). Nested and repeating data types are handled by the construction of a tree of fields. Each field is stored as a separate column on disk. When data is stored by columns, and especially if sorted, the values tend to provide more identifiable patterns and allow for better compression ratios. Google are utilising a wide variety of custom encoding schemes rather than relying on more general purpose schemes like LZW. When data is either ingested or processed it is sampled and the characteristics are compared against a number of encoding schemes. Predictions are given on what compression speeds and ratios could be achieved with each encoding scheme and Procella will then decide which to use for a given column. The above columnar storage format is called "Artus" and it's at least the third such columnar format to originate from Google. ColumnIO was first mentioned in the Dremel paper published in 2010 and was the inspiration for ORC and Parquet. ColumnIO was the internal storage format used by BigQuery until it was superseded by Capacitor in 2016. The columnar format storing the compressed data on disk matches its in-memory representation. This avoids any encoding or decoding that might occur between reading data off of disks and using said data in memory. The Data Servers processing query fragments have encoding-aware functions meaning they can work on compressed data without needing to decompress it first. That being said, for any given dataset there will be more space required in memory than on disk. An example dataset is given in the Procella paper that was 2,406 KB on disk in Capacitor's columnar format and 2,060 KB in Artus on disk required 2,754 KB of RAM when loaded into memory by Procella.

Importing Data into Procella Tables are created and modified in Procella via standard SQL CREATE and ALTER statements. Batch data is often generated outside of Procella using MapReduce and then fed into it in hourly or daily intervals. If a dataset is for real-time data there are additional options on how to age-out, down-sample and/or compact the dataset. When data is initially loaded from batch the system does its best to avoid scanning all the data in order to ensure a high throughput of ingesting data. Secondary indices can be produced by three different services so it's not mandatory to create them at ingestion. This is very much the ELT / Data Lake model and not the traditional ETL model. When real-time data is fed in it's replicated to multiple servers to avoid any potential data loss due to outages. Queries are exposed to all copies of the data and are smart enough to use the most complete copy of the dataset. Real-time data will move through a lifecycle of memory buffers, checkpoints, write-ahead logs, compaction into larger columnar bundles and finally, into durable storage on Colossus. The query engine can combine data held at various points before landing in durable storage with any existing data that is already there. There is also a "stats serving" mode designed for counting video views on YouTube and returning aggregate values within "milliseconds". In this mode data is loaded into memory immediately by the Data Servers rather than waiting for a request for a particular view count before loading it off of disk and into memory. Google do admit this is RAM-intensive but the overall dataset is relatively small.

Query Execution SQL commands are issued by users on a Root Server. The SQL parser was developed by the Google SQL team and later donated to the Procella project. The Metadata Server will provide statistics to the query planner before sending the query out as fragments to various Data Servers. It is also possible for one Data Server to push a query fragment to another Data Server if need be. Procella tries its best to push any compute operations as close to the data as possible. The RDMA shuffle library used in BigQuery was imported into Procella's execution system and is used for JOINs. Affinity scheduling is used so that operations on the same data will often go to the same server. This allows limited cache space to give the greatest likelihood of being hit. Caching schemes used by Procella for YouTube have been successful enough that even though only 2% of their datasets could fit into RAM, they've achieved a 99%+ file handle cache hit rate and 90% data cache hit rate. Procella has a query evaluation engine called "Superluminal". It uses C++ template metaprogramming instead of LLVM when compiling an execution plan. When benchmarked against Google's Supersonic query engine Google reported 5x improvement in performance. This result was only about 50% slower than what could be achieved with hand-crafted and RAM-intensive raw C++ arrays. In order to get the most utilisation out of any one machine in the fleet, the Procella jobs running on Borg are broken up into a lot of small tasks rather than a few large tasks. Any fleet running Borg will have a variety of hardware configurations being used across a heterogeneous cluster. Each task shouldn't expect to demonstrate consistent performance characteristics given the diverse hardware it could potentially run on. That being said, I suspect there must be minimum micro-architectures specified for a machine to enter their fleet and this minimum might be the Haswell micro-architecture. I suspect this because Procella is designed to take advantage of L1 caches and would need to assume a minimum capacity available. Superluminal goes out of its way to process data in block sizes which are estimated to fit into the L1 cache on any given CPU core. In 2018 a screen shot was posted on Hacker News of a Borglet status page (Borg being the computing platform for Procella as well as many other workloads at Google). The page describes the CPU as using Intel's Haswell micro-architecture and running at 2.30 GHz. This architecture generally has 64 KB of L1 cache per core and allows 64 bytes of data per clock cycle to be fed into the L1 cache of any CPU core when using 256-bit AVX instructions. Assuming Google are using a Xeon chip like the E7-8880 v3 with 18 cores, it's possible all combined L1 caches could support 2.25 TB/s of throughput per machine before any sort of CPU clock boost is taken into account. Though the capacity is very limited, this is about the fastest throughput of anything you'll find on an Intel-based system.

Google's RAM vs Core Counts In the above screen shot the system's memory capacity is described as 256 GB. The E7-8880 v3 can supports up to 1.5 TB. 1.5 TB of any sort of RAM will be expensive but the E7-8880 v3 had a suggested retail price of $5,900 so I doubt RAM prices were much of a consideration. The RAM throughput on that chip is stated to be 85 GB/s. RAM has a high cost per GB (especially the ECC RAM Google are using) but that throughput won't be found with that capacity in any other type of storage hardware. The above system has 14.2 GB of RAM per CPU core. Towards the end of the Procella paper they describe five installations of Procella used to power YouTube Analytics, a service that allows creators on YouTube to see metrics relating to their audience and their behaviours. Each of the five instances has 20 TB of RAM and 6,000 CPU cores giving each CPU core 3.4 GB of RAM. The amount of RAM per core is much less than I'd ever run Presto on, I'd typically aim for at least 8 GB per core. I find it interesting that Google would rather keep the total RAM on any one machine down and prefer to spread the total RAM capacity across a larger number of CPUs. Ultimately, their balance of CPU, RAM, disk and networking usage patterns in Procella's workloads pointed them away from high RAM capacity machines. Google were running version 4.3 of the Linux Kernel in the above screen shot and appear to develop a lot of internal tooling in C++ so I'm assuming they've gone out of their way to make the most out of every byte of RAM available to them.