The toolbox

Have you ever felt that there are too many options that could solve the same problem? This is what our engineering team has been experimenting for the last couple of months. We have been playing with various solutions that might work for our big data initiatives.

Exploring the toolbox

Looking for the right tool?

Looking at what we have available is the most important stage in our research. We also follow the same methodology while evaluating the different options we have gotten. It is not valid to use the tool A for a part of the problem and then test tool B with a different part of the problem. We need to make sure that we are testing over the same problem, answering the same question so the test shows valid points on a fair comparison.

Testing each tool

Do we really need to test each available tool out there? Sometimes we might not need it. Based on our use cases we have decided that some tools are better for the job than others, we still have some overlapping in functionalities among them. Now we need to decide between those which overlapped what are the performance advantages that we can get from each of them.

Apache Spark

We started with Apache Spark, which is one of the main components of our solution. Spark can be used on different workflows because its versatility. It also performs better than the majority of the tools for certain kind of jobs such as parallel transformations. It works very smoothly in streaming scenarios while its integration with other tools in the ecosystem is native in most of the cases. So far, we say yay for Spark.

Apache Hive

And then we moved to Apache Hive. Hive lets us to do analytics in the way our Database developers and ETL guys are used to. HiveQL is the language used to code in this tool, but the beauty of it comes to it is quite close to regular SQL which implies that changes to out existing SQL code base are minimal. We have another yay for Hive, too.

Apache Kafka

Apache Kafka was created on LinkedIn. It works as a message queue or distributed log system. It can be used to ingest huge data sets inside our system but also to move data out of it. Kafka handles million of operations per second putting it as the first choice for the data movement job between our systems. Here we have another yay!

What about enterprise tools for Hadoop deployments?

Hadoop Distributions

In this area we have two main paths. One seems to be the perfect choice since the trend seems to be very clear. Everyone is moving to the cloud nowadays so we should too, right? The other path is to use our current on premises hardware to host our Hadoop ecosystem. After several discussion sessions our company wants to go by the second path.

Now that we have decided how we want to deploy our distributed systems we need to pick a Hadoop distribution for it. If we take a look online we will find there are several options available and each of them can add some value to our organization so we invited two of the main Hadoop vendors out there.

First, Hortonworks, a rapidly growing company that makes Hadoop completely open source that gets paid for their support once an enterprise solution has been implemented. We really enjoyed talking to these guys. They seem very capable and knowledgeable, their platform has the components we want and the price for their services was acceptable. On the other hand, they lack of some approaches we were interested in, but they are still a very good option.

Second came Cloudera, the giant of Hadoop. Their tools were the perfect match for our use cases, but more expensive than others we tested before. Their support just rocks and the administration tools fit better into the way we do things in our organization.

In order to be fair, both platform we evaluated are quite good, with enterprise ready tools and both fit most of our use cases. However, Cloudera brings more to the table for a negligible higher amount of money.

Endings

At this point we have realized that there is no such thing like the right tool for the job. Each use case has to be attacked by a different approach. Each task has its own peculiarities which implies that generalizations are bad in this context. We have been working on application independent platforms in order to avoid dependencies to one vendor or another. Our tool set is based on Spark, Hive, and Kafka and each of them is used in a very different way, but sometime they coordinate work to each other increasing the value of each individual tool.

At the end, we will select one of the major distributions of Hadoop but no matter what we select, our requirements will not be fulfilled only with one of them. We might end with a hybrid solution that maximizes the value we create to our customers. While there is not perfect tool, innovation is ultimately what makes business challenges easier to solve. Don’t look for the perfect tool, but the one that offers the most while keeping solutions clean and simple.