I think my company is pretty new to the Big Data world, so part of the learning process has been by trying, failing, and retrying, which is ok if we keep an open mind to the change.

One of the problems we have been facing is how to move an enormous amount of data from MSSQL Server into our Hadoop cluster. For some reasons, which we won’t talk about this time, we have a SQL table that stores raw XML as plain text inside a column. Let’s call this particular column “message”.

At some point, we will be ready to switch our environment so it will only write to Hadoop through technologies such as Apache Kafka, but at that point, we must migrate the historical data we have in SQL. In order to have a precise context, we generate 49 Gb of pure XML every day and each XML file is about 7 Kb. As remainder, all XML files are being written in SQL as plain text into the column “message” of certain table.

My team has been working on moving the historical data into Hadoop in a reliable way. We consider ourselves a .NET shop, so I sat down and in few hours I got an application that reads from SQL and write to Kafka, from where another application, this time written in Scala and running in Apache Spark, will read the XML files and write them down to Hadoop Distributed File System (HDFS).

Because we are very committed to Test Driven Development (TDD) I wrote all functionality tests along of the production code, I built, tested and ran it with a small and controlled data set and as expected, we ended with a file landed in HDFS with the same data in our original data set.

Now, we moved forward and tested with a bigger data set, still an scenario quite far from our production data set, but a still valid since we can do some math from here and get an estimate of how long it will take with our production data.

Disappointed! It took over and our to move half a million records from SQL to Kafka. Some changes to the code followed in order to optimaze reading from SQL. Build, test, run. A result close to the first one came back, disappointed again.

There must be a better way to do this!, so Sqoop came to the game.

Apache Sqoop, a tool built with data movement in mind, so let’s give it a try. I sat down, scan over the documentation, ssh into my Hadoop master node, and ran:

>sqoop >sqoop import — connect “jdbc:sqlserver://10.202.6.181:1433;database=SomeDatabase” — username someUser — password somePass — table messages — hive-delims-replacement “” — target-dir /sql/imports

The command is quite simple, it says that connect using a jdbc driver for SQL Server and everything else is quite self explanatory except for the the part — hive-delims-replacement “”. This part is only to remove “

” characters from the XML files since we want one XML per line in the output file. This is the only way that this file could be use latter for map/reduce jobs in Hadoop. This imported the data set (500 000 records) in less than a min. WOW! quite impressive, but we still need to do some transformations to the data since all the message coming in have included ids, timestamps, and other fields we don’t need.

After imported the data set, I write another Spark app (in Scala) to do the transformations and generate only one file since Sqoop is outputting a file for each job it is running, remember the work is being distributed in the cluster, so many nodes are working and each of them has it own output. The transformations in Spark were quite fast, I am particularly impressed with Spark, but the aggregation process was not, since it has to bring all records together in order to write them all as a sole file.

I went back to the Sqoop’s Doc, and find the some options to the import command. Next, this was ran:

>sqoop import — connect “jdbc:sqlserver://10.202.6.181:1433;database=SomeDatabase” — username someUser — password somePass -m 1 — table messages — hive-delims-replacement “” — target-dir /sql/imports

The only change was the

-m 1

param, which tells to Sqoop to use only one executor so the output file is just one big file. We don’t need a second app to aggregate Sqoop outputs now, but wait, we still need to do transformations on the data, and extract from it the message part, which is the only field we need.

Back to the doc. There seems no way to do transformations in Sqoop while importing, however, we can import using a custom query, so I tried:

>sqoop import — connect “jdbc:sqlserver://10.202.6.181:1433;database=SomeDatabase” — username someUser — password somePass -m 1 — query ‘select message from messages where $CONDITIONS’ — hive-delims-replacement “” — target-dir /sql/imports

We are now selecting exactly what we want, so the output file will have one XML message per line. The $CONDITIONS in the where closure seems to used for Sqoop to control the reading from SQL. Remember that the import process happens in parallel in the cluster, so Sqoop needs a way to control which jobs in reading what part of the data set.

This time, the command took longer, and I mean about 40 seconds longer. After some calculation the team came with that it will take, theoretically, about two hours to import 8 months of data (50 Gb / Day). We consider the process quite fast, especially if we campare it with our previous .NET solution.

We all agree that Sqoop just rocks! It is fast, eassy to use, distributed and fault tolerant. More than that? it just fixed into our requirements.

Hopefully this post will be useful for those starting in the Hadoop ecosystem. For us it has been a very intense learning experience, so here I am, giving back what ever I have taken in the learning process.