The data formats you know have flaws.

Okay — all data formats have flaws, nothing is perfect. But some are better suited for data streaming than others. If we take a brief look at commonly used data formats (CSV, XML, Relational Databases, JSON), here’s what we can find.

CSV — The Almighty

Probably the worst data format for streaming, all-time favorite of everyone who doesn’t deal with data on a daily basis; CSV is something we all know and have to deal with one day or another.

1/3/06 18:46,6,6A,7000,38.53952458,-121.464

1/5/06 2:19,6,6A,1042,5012,38.53952458,-121.4647

1/8/06 12:58,6,6A,1081,5404,38.53182194,-121.464701

Pros:

Easy to parse… with Excel

Easy to read… with Excel

Easy to make sense of… with Excel

Cons:

The data types of elements have to be inferred and are not a guarantee

Parsing becomes tricky when data contains the delimiting character

Column names (header) may or may not be present in your data

Verdict: CSV creates more problem than it’ll ever address. You may save in data storage space with it, but you lose in safety. Don’t ever use CSV for data streaming.

XML — The Dinosaur

XML is heavyweight, CPU intense to parse and completely outdated, so don’t use it for data streaming. Sure, it has schemas support, but unless you take pleasure in dealing with XSD files (please reach out), XML is not worth considering. Additionally you would have to send the XML schema with each payload, which is very wasteful of resources. Don’t use XML for data streaming!

A typical XML

The relational database format — not really a data format

CREATE TABLE distributors (

did integer PRIMARY KEY,

name varchar(40)

);

We’re getting somewhere though. Looks kind of nice, has schema support and data validation as a first-class citizen. You can still have runtime parsing errors in your SQL statements if someone in your company drops a column, but hopefully, that won’t happen very often.

Pros:

Data is fully typed

Data fits in a table format

Cons:

Data has to be flat

Data is stored in a database, and data definition, storage, and serialization will be different for each database technology.

No schema evolution protection mechanism. Evolving a table can break applications

Verdict: Relational databases have a lot of concepts we desire for our streaming needs, but the showstopper is that there’s no “common data serialization format” across databases. You will have to convert the data to another format (like JSON) before inserting it into Kafka. The concept of “Schema” is great though, so we’ll keep that in mind.

JSON — Everyone’s favorite

The JSON data format has grown tremendously in popularity. It is omnipresent in every language, and almost every modern application uses it.

JSON’s popularity is undeniable. https://trends.google.com/trends/explore?date=all&q=JSON

A typical JSON document

Pros:

Data can take any form (arrays, nested elements)

JSON is a widely accepted format on the web

JSON can be read by pretty much any language

JSON can be easily shared over a network

Cons:

JSON has no native schema support (JSON schema is not a spec of JSON)

(JSON schema is not a spec of JSON) JSON objects can be quite big in size because of repeated keys

No comments, metadata, documentation

Verdict: JSON is a popular data choice in Kafka, but also the best illustration to “how, by giving indirectly too much flexibility and zero constraints to your producers, one can be changing data types and deleting fields”. If you ever had parsing issues in JSON (the ones I talked about in the beginning), you know what I mean.

In Summary

As we have seen, all these data formats have advantages and some flaws, and their usage may be justified in many cases, but not necessarily well suited for data streaming. We’ll see how Avro can make this better. Nonetheless, a big reason why all these formats are popular though is because they’re human readable. As we’ll see, Avro isn’t because it’s binary.