Your command-line tools may be 235x faster but they don’t have the same features

An old article about Big Data from Adam Drake appeared recently on Reddit and HackerNews generating some discussion.

It talks about outperforming a Hadoop cluster with plain old Unix shell utilities, it’s an interesting read and I agree with the points made but in the end I feel like it’s missing the point.

Big Data isn’t just about data, comparing raw speed on a fixed dataset is like comparing a CPU and a GPU, sure they’re both used for computation but they serve a different purpose thus the comparison makes little sense.

So what are we missing here? let us understand the three key factors involved when choosing a Big Data solution.

Scalability

The concept of “Scalability” refers to the ability to grow (or shrink) in terms of computational power, it is considered as two different aspects known as “horizontal scalability” meaning the ability to add more machines to a cluster and “vertical scalability” which represents the ability to add more resources to a single machine.

Of course traditional non-clustered deployments can only scale vertically and this is the first consideration to have in mind, Big Data isn’t about processing “one” batch as in the article, it makes no sense to consider it when you just need one computation but it becomes relevant when you need to consistently process information.

As the article suggests, If your data fits in a single machine there’s no point deploying a cluster but what happens when your data is growing? there’ll be a time when you need to scale and if you can only scale vertically you’ll be limiting your options.

Vertical scaling becomes expensive pretty fast, even when considering Moore’s law because the amount of data we generate is even greater.

Vertical scaling is also problematic because it obsoletes your current hardware, nowadays with so many Cloud solutions it has become easier to replace your infrastructure but that’s not always an option, specially when considering the cost over time.

And finally there’s the concern of failures, on a cluster if you lose a machine you can still work at a slower rate but on a single instance if your hardware breaks you’re out until you replace it, which in case of physical hardware takes time.

Reliability

The article claims that:

creating a data pipeline out of shell commands means that all the processing steps can be done in parallel. This is basically like having your own Storm cluster on your local machine.

But that’s not even close to reality, a Hadoop or Storm cluster isn’t just “parallel computing” both frameworks have been built with reliability in mind and have built-in mechanisms that enable you to deal with failures, without stopping the computation and without losing data.

This is extremely important, once again if you’re doing a single computation you may tolerate restarting the process from the beginning when it crashes (which is exactly what happens if something in your shell pipeline fails, you’re left with partial results and no way to recover) but if you need to consistently process large volumes of data such delays may become fatal.

Performance

So what about performance? most people get attracted to these technologies with the promise of “blazing fast” results, the truth is a bit more complex.

adrianmonk from Reddit explains it with a beautiful analogy:

It’s a lot like starting up a train. If you just want to carry 50 tons of freight, a semi truck might be able to get it somewhere in 2 hours whereas a train might take 1 day. If you want to carry 5,000 tons of freight, the train can still do it in a day.

This is what happens with Big Data solutions, and in fact not only involves the volume of data, it includes the specific calculation and the available resources to compute it; therefore a more useful indicator would be the speed as a function of the cost.

Given that there’s at least three aspects to consider when evaluating the speed (and each varies with time) it is impossible to set a fixed size from which it should be considered a use case for Big Data, it depends entirely on your requirements.

Where to go from here

Big Data is a new and exciting field of technology but as such, there’s still much to learn about it in order to use it properly.

When making a decision people need to realize that there’s not a single criteria and the reasons it may or may not be a good fit for you depend on your circumstances.

Understand your problem and the advantages as well as the disadvantages of each approach before getting caught in trends, every tool has its uses but no tool fits them all.