Wouldn’t be nice if business people have the ability to explore their own data by themselves?

Now that we explained why we need self BI environments, let’s go back to our design session.

How we give to the business a self BI environment

At this point, we have hundred of Gigabytes written in HDFS, but we need to access them efficiently.

There are several tools we could use. Hortonworks provides Apache Hive out of the box. Hive works as an SQL engine that runs on top of HDFS, but wait, our business guys don’t need to know SQL! Based on this, we put Microstrategy at their disposition so they drag and click while Microstrategy generates SQL statements that are executed over our data through Hive on HDFS.

Because the business guys will use microstrategy as a front end tool, we could change Hive (our back end query engine) to any other processing engine such as Spark SQL, and they will still perform the same tasks on their end without problems, hopefully.

Performance

With the Hortonworks platform we experimented a lot of performance issues when using Hive. That was one of the main arguments while moving to the next vendor we tried out.

Apache Spark worked well, but some team members were not happy to code some Scala code every time a new data feed was added (even when all this process can be automated which implies the argument is not valid any longer).

Apache Spark has become one the most popular data processing frameworks out there, we definitely have to come back to it.

Cloudera, a Leader Hadoop Vendor

One day, the guys from Cloudera appeared at office and we got introduced to their platform.

In order to be short talking about our experience we Cloudera, I have to say that the platform it truly at the enterprise level, something we were looking for. All components were updated, we get their customer support since day one, and they worked with us in order to set up the cluster right from the beginning. For us that was just impressive after our experience with our previous vendor.

The management tools were easy to use and understand. The Cloudera tools allowed us to work fast, to discover the advantages of their platform with easy.

Cloudera has the same components that Hortonworks and also adds a new one that we were quite interested in, Apache Impala.

Impala is the faster SQL compliance tools we have seen so far. It was quite faster than Hive. Impala integrates with Microstrategy just fine, so our front end does not have to change at all.

The Common Problem

In both platforms, Hortonworks and Cloudera, we face one common challenge. How do we expose new data that we haven’t defined the schema for it?

The process is the following:

Open the feed file (xml, json, csv, …) manually

Define a table that points to the location where the feed is.

Indicate that our query engine (Hive, Impala) use the defined table.

Start querying the new data.

The question is who is going to do this? No one wants to do it, it should be automated, but sometimes the business wants to look at some files that are completely new for everyone (extractions from different sources they manually generate and use). There was no way to automate that. The business depends on developers to follow the previous process every time it wants to do these operations. Our Self BI idea was not possible, it was incompleted and these two platforms were not enough for us.

The simplicity of MapR

The truth is that we were skeptical when the guys from MapR showed up. However, by the end of their talk, I think we were kind of convince that they have our solution.

Even though this vendor is not the most popular one in the Hadoop world, what they showed us matched our requirements perfectly, so we decided to give it a try.

Setting the cluster up was as easy as in Cloudera, I don’t think there were any problems. However, the console management tool was not as good as the one present in the Cloudera cluster. It was a little more difficult to navigate and understand.

The Advantages of Being POSIX Compliance

MapR is POSIX compliance! That means that our data path get significant simplified. Having a POSIX platform implies we can create a volume inside the cluster and mount it in our Linux or Window environment. The volume will appear as any other drive we have on our computer. We can read / write from / to it as we do with any other attached storage. We in fact, are reading/writing to a distributed storage (our MapR cluster).

Now we can remove Kafka, we can remove the Spark job that aggregates messages and writes them to HDFS and with that we also remove the complexity of all those moving parts in the middle of our data path. Now we have less components to worry about and less hardware requirements.

Part of our initial design was streaming messages to Teradata. We still can do that by changing our Spark streaming source to be the file system instead of Kafka and everything else just works fine.

In an initial data movement test, we executed a rsync command between a network share and a mounted drive that pointed to the MapR file system. We got a 100 Mb / s transfer rate between our environment and the MapR cluster. For us, it was more than enough.

How MapR solves our Self BI Problems

MapR brings a new tool to the table as an addition to the ones found in other vendors. Apache Drill, a tool that auto discover schemas from different sources and exposes them without any previous registration. This is the tool we were looking for.

Yes, we had a lot of problems with it, mostly related to performance. Some team members dedicated hours to work it out, and they did. We can get results back as fast as in Impala with the addition of auto discovering schemas.

Drill exposes data to Microstrategy in the same way that Hive and Impala do it so it can be used in the same fashion.

Because we can mount different volumes, now business people can have their our space in the MapR cluster, so they can copy and paste the data they want to explore (streaming and automatic feeds are processed by automatic processes such as Spark jobs). If a guy creates an extracted file and wants to correlate it to another that comes within the automatic feed, he only needs to:

copy this file to MapR using copy / paste via mounted volume.

open Microstrategy and query the new data.

Behind the scene,

Drill will pick the file up.

will infer the schema so it can be exposed to Microstrategy.

Microstrategy will see this file and

the user will join it to other ones already in the cluster.

Small Files Problem

The small files problem is a big deal in Hadoop. In Hortonworks and Cloudera it was something we really need to take care of. It was the main reason behind the aggregation process we had. We aggregated a lot of small file into a big file so we can avoid the small files problem by reducing the number of files written to HDFS.

MapR does not use HDFS, instead, they have implemented MapR-FS, a file system that exposes a HDFS interface so it can be used by any tools designed to work with HDFS. It can be also accessed through POSIX commands, adding the possibility of work as any other file system compatible with most operating systems.

In MapR-FS, we can have millions of small files without problem. The internal specifications of this file system and how to work with it can be found here. MapR has done an incredible job in this area and we all can benefit from it.

Apache Spark

MapR, in the same way Cloudera does, has relied, heavily, in the use of Spark as one of the main component of their solution. Spark is used extensively as a data processing framework and sometimes as the engine that powers Impala and Drill. Spark is by far, one the most used components in any big data solution because its versatility while working on any task.

Ending thoughts about MapR

As we can see, MapR not only simplified our solution, but also offer a different way to attack the problems we have. MapR can be seen as an extension of our environment instead of a separated environment that we need to architect in order to move and process data within it. Even though MapR management tools are a little behind of the ones offered by Cloudera, they work as they should, they keep our environment stable, always working. MapR helps us to eliminate unnecessary processes so we can focus on the data and how to work with it, and that is what we really need.