As it stands today, the big data ecosystem is just too large, complex and redundant. It’s a confusing market for companies who have bought into the idea of big data, but then stumble when they are faced with too many decisions, at too many layers in the technology stack.

The big data ecosystem has too many standards. It has too many engines. It has too many vendors. The ecosystem, as it exists right now, alienates customers, inhibits funding of customer projects, and discourages political support for them within organizations. So what are you, the user, to do?

The Big Data Ecosystem: Too Many Choices

BI/Analytics: At the top of the stack, there are seemingly endless choices. Whether Enterprise BI stalwarts, BI 2.0 challengers, or big data analytics players, the number of vendors and their similar positioning makes it really hard for customers. It’s difficult to distinguish between solutions – even significantly different ones – when the messaging and imagery are so similar.

At the top of the stack, there are seemingly endless choices. Whether Enterprise BI stalwarts, BI 2.0 challengers, or big data analytics players, the number of vendors and their similar positioning makes it really hard for customers. It’s difficult to distinguish between solutions – even significantly different ones – when the messaging and imagery are so similar. Distributions: Move down in the stack and there’s plenty to choose from at the Hadoop and Spark distribution layer. It’s difficult enough that the “big three” (Cloudera, Hortonworks and MapR) each offer their own distributions of Hadoop, with Spark integrated. But add in other offerings from IBM, and the cloud players, large and small, and things get a little crazy. What’s difficult for the customer here is that the core of these stacks differ in their makeup and/or have different versions of the very same components.

Move down in the stack and there’s plenty to choose from at the Hadoop and Spark distribution layer. It’s difficult enough that the “big three” (Cloudera, Hortonworks and MapR) each offer their own distributions of Hadoop, with Spark integrated. But add in other offerings from IBM, and the cloud players, large and small, and things get a little crazy. What’s difficult for the customer here is that the core of these stacks differ in their makeup and/or have different versions of the very same components. Execution Engines: And speaking of components, we have too many execution engines, too. Hadoop shifted from MapReduce to Tez. Then Spark established itself. And now, it seems, Apache Flink is waiting in the wings. On the streaming side, Apache Storm, NiFi, Spark and Kafka, in various combinations, vie for mindshare. And while big data machine learning started with Apache Mahout, it seems to be shifting to Spark MLlib and elsewhere. Then there are the permutations. For example, Spark can run on YARN, Hadoop 2.0’s resource manager. But it doesn’t have to. And when you use the cloud-based Spark offering from Databricks (the company founded by Spark’s creators), it doesn’t.

And speaking of components, we have too many execution engines, too. Hadoop shifted from MapReduce to Tez. Then Spark established itself. And now, it seems, Apache Flink is waiting in the wings. On the streaming side, Apache Storm, NiFi, Spark and Kafka, in various combinations, vie for mindshare. And while big data machine learning started with Apache Mahout, it seems to be shifting to Spark MLlib and elsewhere. Then there are the permutations. For example, Spark can run on YARN, Hadoop 2.0’s resource manager. But it doesn’t have to. And when you use the cloud-based Spark offering from Databricks (the company founded by Spark’s creators), it doesn’t. SQL, Datasets and Streams : And while SQL made its way into the big data conversation to make it all “easier” to use by existing leveraging skillsets, there are too many SQL on big data solutions too. Should you use Hive, or Spark SQL? If you do use Hive, should you use it on MapReduce, or Tez? Plus, don’t forget Impala. Or HAWQ, Apache Drill, Presto and all the SQL-on-Hadoop bridges from the big database vendors, including Teradata, HP, Microsoft, Oracle and IBM. Let’s not even get into the fact that using SQL can be antithetical to Hadoop and its unique benefits. Yet another layer of confusion. Even within a well-defined stack with a small number of components, fragmentation can be rampant. In the Spark world, you can use Resilient Distributed Datasets (RDDs), DataFrames or Datasets. And Spark developers can use the new Spark Structured Streams for data in motion. But what about Kafka Streams? Those are shiny and new too.

: And while SQL made its way into the big data conversation to make it all “easier” to use by existing leveraging skillsets, there are too many SQL on big data solutions too. Should you use Hive, or Spark SQL? If you do use Hive, should you use it on MapReduce, or Tez? Plus, don’t forget Impala. Or HAWQ, Apache Drill, Presto and all the SQL-on-Hadoop bridges from the big database vendors, including Teradata, HP, Microsoft, Oracle and IBM. Let’s not even get into the fact that using SQL can be antithetical to Hadoop and its unique benefits. Yet another layer of confusion. Even within a well-defined stack with a small number of components, fragmentation can be rampant. In the Spark world, you can use Resilient Distributed Datasets (RDDs), DataFrames or Datasets. And Spark developers can use the new Spark Structured Streams for data in motion. But what about Kafka Streams? Those are shiny and new too. To Code or Not to Code: When it comes to programming languages, should you code in R or Python? What about Scala? And for that matter, why not throw enterprise developers a bone and let them use Java and even C# to write their big data code? There’s control to be had here but at the cost of self-service and enabling more people within your organization.

Best Practices in Moving Forward in the Big Data Ecosystem

Yes, things are in some disarray, but they are far from hopeless. We can clean up this mess, and we can let the significant value that the big data ecosystem has created stand out. At Hadoop Summit San Jose, I presented some ideas for how we, as vendors, analysts, venture capitalists, and everyone else who makes up this big data ecosystem, can make the situation better. But more importantly, I outlined some tips and tricks for customers who are currently attempting to navigate these murky waters.

Big Data Ecosystem Best Practice #1: Always Start With a Use Case

Don’t get sold by shiny tech. In a recent Gartner survey, by far the top big data challenge cited by respondents was “determining how to get value from big data” (58% of respondents). How do you remedy that? Always start with defining your use case, then work your way toward finding the technology that will support it.

Big Data Ecosystem Best Practice #2: Consider Control Vs. Democratization

As hinted at above, it may be tempting to give yourself/your team fine-level controls with tools that allow you to code. But be wary of how much control you actually need – is the greater good better served by getting data into the hands of more people in the organization with self-service tooling? Search for the right balance.

Big Data Ecosystem Best Practice #3: Think Future-Ready

We’ve already seen it. The industry is contracting, expanding, contracting, expanding. That’s why it’s incredibly important that as you evaluate your technology purchase, you look for signs the technology itself is “future-proof” or “future-ready” through modular, “pluggable” architecture. Because, while you may not want to leap on the next shiny new project or standard, you’ll want the option to migrate to it as it becomes prudent to do so.

The Ecosystem is too damn big from HadoopSummit

Click to view the slides from my Hadoop Summit session.

Need some help identifying the right use case for your big data problem? Sign up now for a free big data use case discovery workshop for your organization.