Apache Avro is a well-know and recognized data serialization framework, already officially in use in toolkits like Apache Hadoop.

Avro distinguishes itself from the competitors (like Google’s Protocol Buffers and Facebook’s Thrift) for its intrinsic i. simplicity, ii. speed and iii. serialization rate close to 1:1 (i.e. almost no metadata needed). It has a set of powerful APIs that make very easy to serialize and deserialize text-data; the serialization scheme/algorithm is very effective, since it is very fast and do not introduce byte overhead (here a comprehensive benchmark to get a big picture).

Surfing the web it is possible to read that Avro embeds the schema (abstract data representation) into the serialized byte stream, and this in order to make the deserialization possible anywhere, at any time, by any Avro-powered application. Embedding the schema means the that footprint of Avro’s byte stream might be relevantly higher on average, in comparison with byte streams of its competitors.

In many contexts the flexibility given by Avro, in terms of embedded schema, are not needed: for example, applications that need to exchange messages (well defined data type) over the network, using a binay channel, might avoid the size overhead by avoiding to embed the schema.

Avro offers schema-less serialization capabilities by directly operating on its own Datum Reader and Writer classes. Adopting this kind of serialization schema, the data type specification is not introduced in the byte stream assuming that the receiver or the party involved in the deserialization knows a-priori the data schema. Such technique is used in the HDFS (Hadoop Distributed File System) for the network communications.

On my GitHub account the code of a Data Serializer shows how to adopt the schema-less serialization with very few API calls. Moreover, the test class provides a benchmark of multiple cases: from 64 to 4096 byte per payload (be careful, in the message header a Long is written, a further bunch of byte has to be taken into consideration), in order to measure the performance. The outcome of such benchmark shows that, on average, a combined serialization-deserialization is operated in few microseconds. Hereafter an extract of one of the tests.

Benchamrk Summary { #tests: 100 #repetitions: 5 64bytes payload: mean[us]: 0.08 size[bytes]: 80.0 128bytes payload: mean[us]: 0.144 size[bytes]: 144.0 256bytes payload: mean[us]: 0.272 size[bytes]: 272.0 512bytes payload: mean[us]: 0.528 size[bytes]: 528.0 1024bytes payload: mean[us]: 1.04 size[bytes]: 1040.0 2048bytes payload: mean[us]: 2.064 size[bytes]: 2064.0 4096bytes payload: mean[us]: 4.112 size[bytes]: 4112.0 8192bytes payload: mean[us]: 8.209 size[bytes]: 8209.0 }