Spark and big data discovery: An evolutionary perspective

A lot has been written recently about the rapid change that is occurring within the big data analysis space. Biological evolution can be a useful metaphor in understanding this kind of sweeping technological change. For example, we can trace every mammal living today—from dogs, horses and lions to whales and everyone reading this blog post—back 160 million years to unimpressive creatures. Similarly, we can trace servers, laptops, tablets and smartphones back to the earliest computing machines from the mid-20th century. In those cases, great diversity arose from humble beginnings.

If we look at the recent evolution of business requirements related to big data, and the workflows that support those requirements, we see a similar progression. Organizations that once had a limited number of analytics users and use cases have grown significantly more data driven and data dependent in recent years. Today, they need to put the ability to analyze data into the hands of more business analysts, data scientists and other stakeholders and decision makers than ever before. Organizations need rapid access to all their data without intervening latencies related to data preparation. And they need rapid and highly iterative results.

Enabled by Apache Spark, big data discovery has emerged as a new approach that addresses these deep and diverse requirements. To understand the true significance of big data discovery, it may be helpful to backtrack through the evolutionary steps leading up to it.

The age of curated data

We begin with the era of the enterprise data warehouse (EDW). With a set of business intelligence (BI) tools to go with it, a conventional EDW is a curated environment containing a critical subset of the overall enterprise data set. It is a subset by definition, owing to original limitations of EDW infrastructure, modeling requirements of BI tools and the reality of server and storage costs. The model was developed to support highly structured data, and it requires implementing a complex schema to make sense of that data.

As a result, analysts have traditionally needed a pretty good idea of what question or questions they wanted to ask, and have had to define them up front. Beginning with a new data set, the process of finding the answer to a single question required the participation of an entire team and could take months from inception to returning the first results. The supporting workflow requires extract, transform and load (ETL) programmers, data warehouse architects and administrators, and BI architects and administrators. After an extensive ETL process, the sought-after subset—maybe only 10 percent of the organization’s total data—would be available within the data warehouse and BI tools. Because of the complexity of the process and the long time frames involved, iterating with follow-up questions could prove difficult and time-consuming.

The age of full data sets

More recently, we have seen the emergence of the data lake, which is, in evolutionary terms, a very different beast. First enabled by Apache Hadoop, and the tremendous storage savings it provides—up to 1,000 times, in some cases—the data lake serves as a common data repository for all the organization’s data, not a subset. Between Hadoop and the availability of inexpensive servers and/or affordable cloud deployments, the data lake solves the storage limitations that plagued the EDW era. But on its own, it does not do much to address the complexity and latency issues encountered in the data warehouse model.

Without the predefined schema and careful curation of data that the EDW model provided, Hadoop data requires additional data preparation support before it can be analyzed. To perform SQL queries on Hadoop data, organizations need to first have data administrators organize data in SQL-on-Hadoop systems. Although this approach is a good solution when producing traditional BI reports, it adds time and complexity to the process and doesn’t support discovery or iteration.

The age of data lakes

To enable a workflow that truly leverages the advantages of a Hadoop-based data lake, businesses need a set of tools that can open up all the assets in the data lake to everyone in the organization who needs them. They need to make analysis accessible and iterative. And they need a workflow that reduces the need for many specialized resources, placing core analytical capability into the hands of power business analysts. Businesses also need to empower these analysts to be citizen data scientists, who can free actual data scientists to pursue complex analysis rather than spending their time performing data preparation.

With Spark, a data lake can become a true big data discovery environment. Spark’s emergence as a processing framework for big data is a game changer because its advanced analytics capabilities allow for large-scale data analysis across the enterprise. Spark is aligning the data lake model with these rapidly evolving business requirements. And it reduces or eliminates the need for complex and time-consuming MapReduce programming, allowing organizations to leverage more common and readily available skill sets and resources in deploying analytics on Hadoop. Spark helps eliminate data silos while simplifying data preparation and enabling accelerated and iterative analytics for data scientists. It also enables data transformation capabilities that make it possible to prepare petabyte-scale data within an integrated workflow.

The evolution continues

Evolution shows us how we came to have a planet populated by a huge variety of mammals and a computing device landscape with incredible diversity, richness and capability. Big data discovery is just beginning to open up a similarly diverse and powerful set of capabilities for organizations looking to make the most of the assets in their data lakes. Learn more about how Platfora is defining the new era of Spark-enabled big data discovery and the new opportunities big data discovery represents.

Spark your way to amazing discoveries

Follow @IBMBigData