Protocol Buffers, Avro, Thrift & MessagePack

Perhaps one of the first inescapable observations that a new Google developer (Noogler) makes once they dive into the code is that Protocol Buffers (PB) is the “language of data” at Google. Put simply, Protocol Buffers are used for serialization, RPC, and about everything in between.

Initially developed in early 2000’s as an optimized server request/response protocol (hence the name), they have become the de-facto data persistence format and RPC protocol. Later, following a major (v2) rewrite in 2008, Protocol Buffers was open sourced by Google and now, through a number of third party extensions, can be used across dozens of languages - including Ruby, of course.

But, Protocol Buffers for everything? Well, it appears to work for Google, but more importantly I think this is a great example of where understanding the historical context in which each was developed is just as instrumental as comparing features and benchmarking speed.

Protocol Buffers vs. Thrift

Let’s take a step back and compare Protocol Buffers to the “competitors”, of which there are plenty. Between PB, Thrift, Avro and MessagePack, which is the best? Truth of the matter is, they are all very good and each has its own strong points. Hence, the answer is as much of a personal choice, as well as understanding of the historical context for each, and correctly identifying your own, individual requirements.

When Protocol Buffers was first being developed (early 2000’s), the preferred language at Google was C++ (nowadays, Java is on par). Hence it should not be surprising that PB is strongly typed, has a separate schema file, and also requires a compilation step to output the language-specific boilerplate to read and serialize messages. To achieve this, Google defined their own language (IDL) for specifying the proto files, and limited PB’s design scope to efficient serialization of common types and attributes found in Java, C++ and Python. Hence, PB was designed to be layered over an (existing) RPC mechanism.

By comparison, Thrift which was open sourced by Facebook in late 2007, looks and feels very similar to Protocol Buffers - in all likelihood, there was some design influence from PB there. However, unlike PB, Thrift makes RPC a first class citizen: Thrift compiler provides a variety of transport options (network, file, memory), and also tries to target many more languages.

Which is the “better” of the two? Both have been production tested at scale, so it really depends on your own situation. If you are primarily interested in the binary serialization, or if you already have an RPC mechanism then Protocol Buffers is a great place to start. Conversely, if you don’t yet have an RPC mechanism and are looking for one, then Thrift may be a good choice. (Word of warning: historically, Thrift has not been consistent in their feature support and performance across all the languages, so do some research).

Protocol Buffers vs. Avro, MessagePack

While Thrift and PB differ primarily in their scope, Avro and MessagePack should really be compared in light of the more recent trends: rising popularity of dynamic languages, and JSON over XML. As most every web developers knows, JSON is now ubiquitous, and easy to parse, generate, and read, which explains its popularity. JSON also requires no schema, provides no type checking, and it is a UTF-8 based protocol - in other words, easy to work with, but not very efficient when put on the wire.

MessagePack is effectively JSON, but with efficient binary encoding. Like JSON, there is no type checking or schemas, which depending on your application can be either be a pro or a con. But, if you are already streaming JSON via an API or using it for storage, then MessagePack can be a drop-in replacement.

Avro, on the other hand, is somewhat of a hybrid. In its scope and functionality it is close to PB and Thrift, but it was designed with dynamic languages in mind. Unlike PB and Thrift, the Avro schema is embedded directly in the header of the messages, which eliminates the need for the extra compile stage. Additionally, the schema itself is just a JSON blob - no custom parser required! By enforcing a schema Avro allows us to do data projections (read individual fields out of each record), perform type checking, and enforce the overall message structure.

“The Best” Serialization Format

Reflecting on the use of Protocol Buffers at Google and all of the above competitors it is clear that there is no one definitive, “best” option. Rather, each solution makes perfect sense in the context it was developed and hence the same logic should be applied to your own situation.

If you are looking for a battle-tested, strongly typed serialization format, then Protocol Buffers is a great choice. If you also need a variety of built-in RPC mechanisms, then Thrift is worth investigating. If you are already exchanging or working with JSON, then MessagePack is almost a drop-in optimization. And finally, if you like the strongly typed aspects, but want the flexibility of easy interoperability with dynamic languages, then Avro may be your best bet at this point in time.