TL;DR: If you are considering using an alternative binary format in order to reduce the size of your persisted JSON, consider this: the final compressed size of the data has very little to do with the serialization method, and almost everything to do with the compression method. In our testing, Brotli proved to be very effective for long-term persistence.

Motivation

There are a lot of data serialization formats out there. Perhaps none are more pervasive than JSON: the de facto serialization method for web applications. And, while it certainly isn’t perfect, its convenience and simplicity have made it our format of choice at Lucid. However, we recently undertook a project that made us question whether or not we should be using JSON at our persistence layer.

In order to improve the performance and fidelity of our revision history feature, we decided that we should start persisting ‘key-frames’ (snapshots) of our document-state data (rather than just the deltas). Our plan was initially to just gzip the document-state JSON when persisting the snapshots. However, as we started sampling some data and crunching the numbers, we realized that within a year or two we would have hundreds of terabytes of data, costing thousands of dollars per month in infrastructure costs. So, even if we could only reduce the size of our persisted data by a few percentage points, it would translate to real-world savings. Thus, we decided to investigate alternative serialization and compression methods to find the pair would minimize the costs of persisting the new data.

Serialization alternatives

JSON is human readable, relatively concise, simple to understand, and is universally supported. But, its simplicity and human-readability mean it isn’t the most space-efficient format out there.

For example, representing the number 1234.567890123457 will take 18 bytes in UTF-8 stringified JSON. However, a binary format could represent the same number as a 8-byte floating point double. Similarly false will be 5-bytes in JSON, but a single byte (or conceivably less) in a binary format. Because our document state includes plenty of booleans and numbers, it seemed like a no-brainer that a binary serialization technique would beat out JSON.

We decided to test out the following serialization methods [1]:

CBOR

Smile

BSON

MessagePack

Ion (Both Textual and Binary formats)

Compression alternatives

Historically, we have just used gzip for compression of our document-state because it is fast, gets effective results, and works natively in the JVM. However, a couple of years ago we started using Brotli to compress our static front-end javascript assets, and saw very good results. We thought it might be a good fit on our document-state JSON as well. We also decided to try XZ, Zstandard, and bzip2.

Methodology

As a test bed of documents, we decided to use our system templates (Lucidchart and Lucidpress ‘blueprint’ documents that we provide to our users) as our sample data. We have about 1500 templates totaling 133.8 MB of document-state JSON.

For this set of documents we tried every combination of binary format and compression algorithm (at their various compression levels [2]). From the tests, we wanted to record three primary metrics:

The total CPU time to convert from JSON, serialize, and compress

The final compressed size

The total CPU time to decompress, deserialize, and then convert back to JSON

Results

So, after running the tests and measuring the data, we ended up with something like this:

Binary Format Compression Compressed Size (bytes) JSON -> Compressed Time (msec) Compressed -> JSON Time (msec) BSON bzip2 (9) 17,344,419 3,979 3,651 CBOR Uncompressed 101,739,795 492 2,310 Smile Zstandard (0) 16,476,903 501 2,312 JSON XZ (6) 12,908,440 5,850 2,195 CBOR Zstandard (3) 15,923,174 541 3,138 CBOR Zstandard (22) 13,999,497 12,826 2,698 Smile Brotli (9) 14,704,655 3,729 2,675 CBOR Zstandard (9) 14,625,854 704 2,877 Textual Ion gzip 16,049,903 912 3,254 MessagePack Zstandard (-5) 19,750,507 914 2,225 … 263 More Rows …

It was relatively easy to draw a couple simple conclusions from these results. For example, just looking at the uncompressed sizes, Binary Ion was by-far the most compact for our datasets. And looking at the compressed sizes, Textual Ion using the highest level of Brotli compression was the smallest.

Most compact binary formats

Binary Format Uncompressed Size (bytes) Binary Ion 63,672,734 Smile 72,283,777 MessagePack 96,113,007 CBOR 101,739,795 Textual Ion 117,878,664 BSON 129,823,535 JSON 133,284,487

10 most compact compressed formats

Binary Format Compression Compressed Size (bytes) Textual Ion Brotli (11) 11,903,951 JSON Brotli (11) 11,999,727 MessagePack Brotli (11) 12,194,016 Textual Ion Brotli (10) 12,277,394 JSON Brotli (10) 12,358,964 MessagePack Brotli (10) 12,556,622 MessagePack XZ (6) 12,734,272 Textual Ion XZ (6) 12,840,804 MessagePack XZ (5) 12,843,748 CBOR Brotli (11) 12,861,137

However, our goal in comparing serialization formats and compression methods was not to simply find the smallest format. Our goal was to minimize our infrastructure costs. So, how do we use all of the measured data to find the optimal solution? Is the space savings by using Brotli 11 worth the extra CPU time it will take to compress?

Analysis

While impossible to perfectly predict, we actually can use the measured data to provide a pretty good estimate of how many dollars each method would actually cost Lucid. This is because our services are deployed in AWS, and we can choose to pay a ‘fixed cost per CPU second’ by using Lambda, and S3 costs for storing the data are relatively straightforward. So, if we want to calculate an expected costs, we simply need to get some estimates and assumptions for the following:

The cost per GB, per month to store the data in S3

The average cost per CPU second

The expected lifespan of the data

The expected number of times the persisted data will be used (decompressed and deserialized)

With these estimates and measured results, we can now assign an expected cost for every combination of serialization and compression technique and simply choose the one with the lowest expected cost!

Here are the assumptions that we ended up using:

AWS Pricing:

Lamda Cost per Second (for a full vCPU) $0.00002917 S3 Standard Cost (per GB/Month) $0.022 S3 IA Cost (per GB/Month) $0.0125

Assumptions about our data

The percent of our snapshots that will end-up in Infrequent Access in S3 95% Expected Lifespan of data 90 months Number of times (on average) we would need to read (decompress and convert back to JSON) a given snapshot 2

And when you run those numbers, you get the following results:

Final expected pricing results

Serialization Compression Expected Cost JSON Brotli (10) $0.01572 MessagePack XZ (2) $0.01579 JSON XZ (6) $0.01583 MessagePack Brotli (6) $0.01584 MessagePack Zstandard (15) $0.01587 MessagePack XZ (6) $0.01588 MessagePack XZ (1) $0.01588 MessagePack Brotli (5) $0.01589 … JSON gzip $0.01870 … JSON Uncompressed $0.14551

This shows that, given our measured results and estimated costs, our best bet is using JSON serialization and level 10 Brotli compression (its second highest setting). This represents an expected 16% cost savings over our baseline of JSON and gzip!

Conclusions

We actually tried a variety of assumptions to see if these results held true in different circumstances. Interestingly, JSON was always at the top of the list—and if it wasn’t the absolute best, it was still very competitive (within a couple percentage points of expected cost).

Our analysis reveals a few interesting conclusions that are almost certainly applicable to many others’ circumstances as well.

Binary formats do result in smaller uncompressed file sizes

Compressing the serialized data seems to level-the-playing-field and ‘negates’ any wins by using the binary format.

Thus, the final compressed size of the data has very little to do with the serialization method, and almost everything to do with the compression method.

Choosing the best compression algorithm is a balancing game between the cost to store the data and the cost to compress the data, but you can choose the right balance according to your expected lifecycle and read patterns.

Recommendations

While it won’t apply to every scenario, the general takeaways and recommendations from the observed data are this:

If the data originated as JSON, and needs to be converted back to JSON in-order to be used, then JSON is probably going to be the most cost-effective persistence format as well. Just so-long as you choose the appropriate compression algorithm.

Brotli works really, really well with JSON data. At its higher levels (10 & 11), it can be CPU expensive, but that will be cost-effective when the data has a long lifespan. Brotli also has the advantage that it can be served directly to any modern major browser.

For data with shorter lifespans, Zstandard (around level 9) offers much better compression than gzip but at roughly the same CPU cost.

We’ve published the full set of data and assumptions here. You’re welcome to take our measured data set and plug in your own own assumptions and costs to see what formats might meet your needs. Of course, our document data won’t be necessarily representative of your data, but this might be a helpful starting point in comparing your options.

Footnotes