In my recent interviews, I’ve found that the challenge many companies are trying to get right is scalable data pipelining. I often get questions about how I would combine multiple dirty data sources or account for missing data. After giving this question a lot of thought, I think our quest for the latest, greatest technology stack often overcomplicates the issue of data-pipelining — all data can be broken down into bits and pipelining just moving those bits effectively. Instead of spending money and manpower creating a data pipeline in a high-level language (Python/NodeJS/Java), why not use the battle-tested UNIX microservices for the exact job that they were created to perform optimally for?

Using UNIX utilities for data pipelining does introduce some constraints to how you handle the data. Here are some off of the top of my head, I’m sure there is some UNIXphilosophy.txt out there that outlines the principles more thoroughly and elegantly:

All information in the data must be explicitly expressed rather than implicitly expressed. For instance, if you have a CSV file where each entry has an index, you must explicitly have that index stored in a field rather than rely on the order of the entries in the CSV file. UNIX utilities don’t usually care about preserving order, depending on the specific utility.

For instance, if you have a CSV file where each entry has an index, you must explicitly have that index stored in a field rather than rely on the order of the entries in the CSV file. UNIX utilities don’t usually care about preserving order, depending on the specific utility. Each ‘unit’ (entry) of data must be uniquely identifiable and expressed. This especially applies to carriage returns or newlines. Let’s take an example of a large JSON file. “Pretty printing” would not be allowed. For instance:

{

value: 'Hello'

},

{

value: 'World'

},

...

would not be acceptable, as this would mess up commands like diff. Rather, you would want to do something like:

{ value: 'Hello'},

{ value: 'World'},

...

All of the information should be as compressed as possible. This is just a general rule of thumb that allows UNIX utilities to work optimally. UNIX is all about minimalism.

This is just a general rule of thumb that allows UNIX utilities to work optimally. UNIX is all about minimalism. Use UNIX utilities as microservices like they were designed to be! This means taking full advantage of all of the abilities in the modern shell (piping/redirection/etc). Most of the bits of functionality we require for data pipelining are available in individual UNIX tools. Modern shells have been developed of decades to optimize data flow between UNIX utilities. Data pipelining can be confusing if you let it be.

This means taking full advantage of all of the abilities in the modern shell (piping/redirection/etc). Most of the bits of functionality we require for data pipelining are available in individual UNIX tools. Modern shells have been developed of decades to optimize data flow between UNIX utilities. Data pipelining can be confusing if you let it be. Trust the user. This is where many, MANY data pipelining solutions fall short — they try to do too much. Data pipelining is never going to be cookie cutter due to the nature of the beast. You have to expose everything to the user of your solution rather than try to automate everything. In a way, this fits nicely within the UNIX philosophy of creating something that does one thing well. Don’t try to handle every scenario within your solution, or else things will get complicated and, shortly thereafter, go horribly wrong.

I would argue, however, that all of these constraints would have to be implementing in one way or another no matter what platform you build your data pipelining on top of (if you don’t have a well structured template for your data, you have to address that with more code).

I have created a short proof of concept here. Please visit it and run the example to see how easy something like this can be.