In the last blog post, I discussed why there is a lot of experimentation in the big data world, and also why most big data experiments never make it into production. This was famously noted in a late 2016 Gartner press release which stated, “Only 15 percent of businesses reported deploying their big data project to production.”

In this blog post, I will walk through what you can do to use big data automation to overcome the top 5 technical challenges that block organizations from fully taking advantage of big data. The good news is that you no longer need an army of experts to make this all work. There has been a ton of investment in the space to automate away the complexity and make it possible to build end-to-end big data pipelines with little-to-no big data or Hadoop expertise.

If you haven’t read the previous post, there is no need to go back and reread it. I have included the problem and the response altogether in this post. Here are our top 5 technical reasons that big data projects don’t make it into production AND, what you can do about it:

Challenge 1: Can’t load data fast enough to meet SLAs.

While tools like sqoop support parallelization for data ingest to get data from legacy sources into a data lake, you need an expert to make it work. How do you partition the data? Do you need to run 10 mappers or 20? How do you know? If you can’t properly parallelize the ingest of data, ingestion tasks that could be done in an hour can take 10 to 20 times longer. The problem is that most people don’t know how to tune this properly.

Solution: First of all, don’t hand code a solution here. There are many vendors tackling this problem on top of Hadoop so you don’t have to write a bunch of code to solve the data ingest problem. You have some choice in this area, so absolutely under no circumstances should you be hand coding.

So when you are evaluating vendors, make sure to consider how much automation they provide to minimize or eliminate hand-coding altogether. Do they connect to all of the data sources you care about? If they don’t automate the entire process, how difficult are the areas they don’t automate? Does the vendor provide fast, parallel, native path access to your sources, or is it JDBC? You also want to be careful to look for pretty user interfaces that look like they don’t require coding, but when you double-click on an icon, you find a bunch of ugly code underneath that you will have to write yourself.

Challenge 2: Can’t incrementally load data to meet SLAs. Most organizations aren’t moving their entire operations onto a big data environment. They move data there from existing operational systems to perform new kinds of analysis or machine learning. This means that they need to keep loading new data as it arrives. The problem is that these big data environments don’t support the concept of adds, deletes, or inserts. This means you have to reload the entire dataset again (see point 1 above) or you have to code your way around this classic change data capture problem.

Solution: Once again, the solution is to automate the process and many of the vendors who automate the ingest problem help with this process as well. Note that you need to be able to deal with two challenges here. The first is change data capture on the source. You have to have a way to identify that new rows or columns have been added to the source system and then move just those changed rows or columns into your data lake. The second challenge is handling the merge and synching of the new data into the target big data system, which, once again, doesn’t support the concepts of add, deletes, or inserts. That means that whichever ingest vendor you choose, they better take care of this issue for you as well. Note that as of writing this article there are some new open source technologies that offer support for these concepts, but also as of writing this, they are not very mature and not very good.

That said, there are a lot of commercial tools that now address the incremental ingestion of data and fully automate this process. Make sure to look closely at the amount of effort that is required to configure these solutions. You should only need to choose the type of approach you want to take to monitor changing data on the source (e.g. log-based, query-based, timestamp-based, etc). In addition, the solution should be able to monitor for new rows being added. If there is a column added to a table on the source you are ingesting, it should also detect that as well and automatically add that new column to the ingest process and merge it properly into the data lake. This is typical of the kind of automation that is now available in the market.

Challenge 3: Can’t provide reporting access to data interactively. Imagine you have 1000 BI analysts, and none of them want to use your data models because they take too long to query. Actually, you only need one data analyst to make this unbearable. This is a classic problem with Hadoop and is the reason why lots of companies only use Hadoop for preprocessing and applying specific machine learning algorithms but then move the final data set back to a traditional data warehouse for use by a BI tool. Regardless, this adds yet one more step in the process that gets in the way of successfully completing a big data project.

Solution: Once again, there are lots of companies that provide solutions that can take files in HDFS or Hive and generate OLAP cubes that can then be accessed from visualization tools like Tableau via JDBC/ODBC. All of these solutions operate on the same basic principle that pre-calculate the cube and then leverage the distributed computing power of the cluster to present the OLAP cube to a BI visualization layer. The details may differ, but the basic concepts are all the same and they allow a Hadoop based environment, which isn’t known to be very good for interactive queries, into an environment that can be used for interactive queries.

Challenge 4: Can’t migrate from test to production. Many organizations have been able to identify the potential for new insights from the data scientist working within their sandbox environment. Once they have identified a new “recipe” for analytics, they need to move from an individual data scientist running this analysis in their sandbox to a production environment that can run every day. Moving from dev to production is a complete lift and shift operation that is generally done manually. And while it ran just fine on the dev cluster, now that same data pipeline has to be re-optimized on the production cluster. This tuning can often require significant rework to get it to perform efficiently. This is especially true if the dev environment is in any way different from the production environment.

Solution: The challenge here is that in this case, there isn’t a long list of vendors who actually tackle this problem. There are a lot of “data prep” applications out there that are great for data scientists who are basically mining the data and prototyping potential “recipes” that could be used for decision making. But once they discover these recipes they leave it as an exercise for the user to convert the query or analytic or machine learning algorithm into a repeatable process that can be continuously run at scale.

The obvious answer once again to this challenge is automation as this is a feature set that has been available for a long time in "old-fashioned" data warehouse environments. You should look for capabilities that automate the process of promoting a project from dev to test to production. Along the way, it should automatically adjust and optimize your data pipelines to take advantage of the size of the production cluster. No recoding or reimplementation of the pipeline should be required. This means that the same self-service that data scientists are taking advantage of for data discovery should be delivered as well, all the way through to the push to full production. This is what the data warehousing world was used to, and there is no reason that you shouldn't also expect the same capabilities in the big data world.

Challenge 5: Can’t manage end-to-end production workloads. Most organizations have focused on tooling up so their data analysts and scientists can more easily identify new insights. They have not, however, invested in similar tooling for running data workflows in production where you have to worry about starting, pausing, and restarting jobs. You have to also worry about ensuring fault tolerance of your jobs, handle notifications, and orchestrating multiple workflows to avoid “collisions.”

Solution: Here you could attempt to use Cloudera Navigator or Apache Atlas. They do some minimal tracking of data pipelines. But they really only report on lineage and don’t do anything to optimize your workloads for you. If pipeline A is dependent on Pipeline B finishing before it can complete its run, this is something that you would have to figure out yourself. Navigator and Atlas won’t do it for you. The alternative is usually hand-coding scripts and manually checking dependencies. Another possibility is running traditional enterprise scheduling tools which provide basic orchestration but not for managing hundreds of pipelines with different SLAs and dependencies. At the end of the day, you are either going to manage the pipelines mostly manually, using some of these tools as visual aids, or you will write a bunch of code yourself.

What you should look for here is that whatever tool or tools you look to use for creating your ingest and transformation pipelines, they should automatically feed into the processes for managing the process of orchestrating those pipelines in production. These capabilities are available in the market, is just one of several companies that deliver this functionality.

The Bottom Line

The bottom line is that you don’t need nearly as much expertise as was required 5 years ago when Hadoop first started to get big. The first wave of automation came into existence about 3 years ago and automated individual slices of the big data pipeline from ingest to consumption. There is now a second wave of solution providers that don’t just automate an individual slice but automate the entire end-to-end data pipeline in a fully integrated fashion.

Regardless of whether you go with the first wave of automation, or what is now appearing as a second wave, you should not have to hand-code any of your big data pipelines either in development or in production. So if you find the majority of your big data effort turning into a coding effort in Python, Pig, Hive, Scala, etc., you are doing something wrong. The tools and platforms are now available that your existing data and business analysts should be able to achieve a relatively high level of self-service without having to become big data experts.