Cap’n Proto

Cap’n Proto is another binary serialization format developed by author of protobuf v2. The library documentation claims it was developed taking into consideration years of experience of work with protobuf and feedback from protobuf users.

Also documentation of cap’n proto makes a notable claim that the library works infinity times faster than protobuf. As for me, this claim is a total nonsense, as protobuf takes finite time to execute, hence, execution time of cap’n proto must tend to zero. Official documentation clearly shows that this is exactly what was intended to be said. But is it possible for some useful work to be completed instantly? Law of energy conservation with common sense tell it’s not possible. So, what is the point of such claim? I guess it’s just a marketing move, however, I can hardly imagine it’s target audience who would believe in such claims.

From functional point of view, one may think about cap’n proto as an attempt to create improved protobuf. It provides support of primitive scalar types, enums, lists, groups and so on in a similar way protobuf does. The library uses schemas which have a syntax very similar to protobuf’s. There are other similarities which are described in schema language documentation along with differences.

Now it’s clear that cap’n proto is very similar to protobuf. At this point we could move on to measurements, but there are couple of things which are definitely worth to be mentioned: implementation and support of Python.

To start using cap’n proto with Python, compiled library itself and pycapnp Python bindings must be installed. They can be installed separately or cap’n proto can be installed automatically during installation of Python bindings. Even at installation point it’s possible to get lost. This is because of two things:

It’s possible to install cap’n proto via system’s package manager (e.g., on Ubuntu, Ubuntu-based distros and on mac os), but version of library will be outdated. If you install the latest version of the library using system’s package manager and the latest version of pycapnp via PyPI (with --force-system-libcapnp flag passed, which is undocumented), you will end up with incompatible libraries. Documentation for pycapnp is very outdated. At this moment it supports version 0.5.4 while current version of cap’n proto is 0.6.1. Documentation contains steps for installation of version 0.5.0. Also it includes links to author’s fork of official repository. Those links are redirected to official repository, but you still can follow documentation and clone outdated fork instead of origin. Moreover, official repository points to that documentation also. Finally, the same documentation is available at 2 different addresses: capnproto.github.io and jparyani.github.io. So, if you will want to install libcapnp separately from pycapnp , it is very probable you will try to build them from sources, but you will not be able to compile or use pycapnp because its version will not be compatible with version of libcapnp . You may figure out that you have cloned a wrong repo after hours of struggling to compile it and to find a right combination of libraries. That can be horrible.

So, if you decide to use pycapnp , it’s better just to install a package from PyPI, which will automatically download and compile dependent version of libcapnp .

Next thing you may want to do is to define and compile a data schema. While doing this, you will find out that:

Field names are restricted to be defined in camel case and to start with a letter in lower case. Unlike protobuf, cap’n proto does not have support of optional and required directives. Schema generator just does not work. It tries to import an own schema definition, which is present in package, but is not compiled. This makes generator to raise an error, preventing you from using it. As a workaround, schema can be loaded without being compiled.

Obviously, restriction on format of field names means that extra field name conversion work must be done before serialization and after deserialization. Such work barely can be automated and custom converter of field names must be used for each field of each record.

Next, absence of optional fields means that field values cannot be omitted. Cap’n proto does not allow to use empty values as well. This means that you need to use some flag as a value to indicate empty value. E.g., empty string can be used for string type and -1 can be used for numeric types. But this can cause extra complexity of understanding which value is real and which is just an indicator of empty value. For example, -1 value is not suitable for unsigned integers. But even if you find such values, which can be used as emptiness indicators, you will need to set them before serialization and analyze them after deserialization.

As for streaming, Cap’n proto does not support it like in case of protobuf. To save a batch of data, you will have to initialize an array of proper size and then fill it with records by iterating them. Record’s data is filled by manual assignment of values to fields. There is a support of multi-message files, but examples look suspicious to me.

Cap’n proto allows data to be stored in packed and unpacked format. Unpacked format is used by default and it takes 38.4% more disk space.

One more notable thing is a message size limit. It is controlled by traversalLimitInWords parameter. By default it equals 8 * (2 ** 20) and comments say it stands for 64 MiB. Also, comments state that this limit was introduced for security reasons. To load bigger messages with pycapnp , this limit needs to be increased using traversal_limit_in_words argument for read() or read_packed() methods.

Next, just like in case of protobuf, deserialized records are stored in data structures generated from schema. It may be OK to use them as-is, but also you might need to convert them into dictionaries to be able to process by other libraries.

Finally, there are a couple of words about bugs. They are present and everyone must be aware of them. It seems like the claim that cap’n proto library is infinity times faster than protobuf is true, because cap’n proto accesses data on demand: it does not load data until you need it. This means that if you “read” serialized data from a file, close file descriptor and try to access some record, you will get an error related to bad file descriptor being used. But even if you access all records while file is open to load all data explicitly, there is no guaranty that you will not run into an error. For example, my try to load packed data has raised a file descriptor error similar to an error described in this issue.

The table below shows results of resource usage measurements for serializarion and deserialization of unpacked data, deserialization of unpacked data with conversion of records to dicts and serialization of packed data. Measurements of deserialization of packed data are not provided due to the bug described in the previous paragraph.

┌────────────────────────────────────────────────────────┐

│ Table 15 — Cap'n Proto resource usage │

├────────────┬───────┬───────┬──────┬───────┬────────────┤

│ │ │ │ │ │ │

│ │ │ │ │ │ │

│ │ Save │ Load │ Load │ File │ Compressed │

│ │ time, │ time, │ RSS, │ size, │ file size, │

│ Approach │ s │ s │ MiB │ MiB │ MiB │

├────────────┼───────┼───────┼──────┼───────┼────────────┤

│ default │ 30.3 │ 1.4 │ 231 │ 163.1 │ 26.5 │

│ as dict │ N/A │ 279.2 │ 2753 │ N/A │ N/A │

│ packed │ 31.2 │ N/A │ N/A │ 100.8 │ 24.3 │

└────────────┴───────┴───────┴──────┴───────┴────────────┘

As it can be seen, loading data using internal structures takes only 1.4 second and 231 MiB of RAM. However, their conversion to dictionaries increases load time up to 279 seconds and consumes 2.75 GiB of RAM, which is 199 and 12 times more respectively.

The table below provides information about compliance of protobuf to other defined criteria.