For nearly a decade, Hadoop was the poster child for “big data.” It was new, it was open source, it launched an entire market of products and vendors, and it was inspired by — and in many cases, was — the technology behind the world’s largest websites. However, looking back with 20/20 hindsight, it seems clear that Hadoop was never going to live up to its lofty expectations. It was big and new, but released into a world that would soon come to value speed, flexibility, micro, and even known quantities.

Hadoop’s path to ubiquity intersected a host of other technology shifts that as a whole would prove to be more impactful in the long run, in part by peeling off the most valuable promises of big data and making them more consumable. The story of Hadoop can help us understand why the world of data looks how it does today. It also should be a valuable lesson for anybody trying to make sense of the next big thing in enterprise IT, and the next one after that.

With so much money flying around and so much pressure for enterprises to become tech-savvy (aka digital transformation), there’s a tendency to view every new thing as the thing that’s going to make all the difference. It’s a tall order, but organizations should try to see through the hype and discern the line between what’s promised and what they actually want. And then identify the technologies that will best help them do it.

There’s always room for more …

For sure, everybody wanted (and still wants) the abilities that Hadoop initially promised. They wanted to collect lots of unstructured data from web logs, weather records, and other relatively novel sources, and analyze it to find new trends or unique business insights. Many executives wanted to become data-driven, unencumbered by those pesky gut feelings and qualitative evidence. During the infancy of big data, those were the battle cries.

It was “The Unreasonable Effectiveness of Data” brought to life and presented as enterprise IT. This wasn’t just Google’s game anymore.

But something happened within the big data world to erode Hadoop’s foundation of a distributed file system (HDFS) coupled with a compute engine for running MapReduce (the original Hadoop programming model) jobs. Well, many things happened. Here’s an abbreviated version:

Mobile phones became smartphones and began generating streams of real-time data. Social networks took off and began generating streams of real-time data. Cheap sensors and the Internet of Things took off and began generating streams of real-time data. “MapReduce” and “real-time” are not words often associated with each other.

Social networks took off and began generating streams of real-time data. Cheap sensors and the Internet of Things took off and began generating streams of real-time data. “MapReduce” and “real-time” are not words often associated with each other. Companies were reminded that they had already invested untold billions in relational database and data warehouse technologies that actually worked pretty well. And everyone already knew SQL.

that actually worked pretty well. And everyone already knew SQL. Competitive or, at least, alternative projects such as Apache Spark began to spring up from companies, universities, and web companies trying to push Hadoop, and the whole idea of big data, beyond its early limitations.

from companies, universities, and web companies trying to push Hadoop, and the whole idea of big data, beyond its early limitations. Venture capital flowed into big data startups. This included the startups that built themselves around Hadoop, but also to new projects and related technologies. Although everyone was theoretically headed in the same direction, their incentives were sometimes at odds when it came time to grow their businesses.

The response from the Hadoop community, understandably, was to integrate with as many technologies as made sense, and to build an orchestration layer to schedule jobs from these various pieces across a large shared infrastructure. So we got Pig and Hive and SQL-on-Hadoop and YARN, and integrations with Storm and Kafka and Spark and so on. Hadoop became a true data platform, albeit notoriously complex and difficult to operate.

No technology exists in a vacuum

Simultaneously, other things were happening, sometimes independent of the big data space, but definitely within the realm of influence for the Hadoop and overall big-data community. Essentially, our collective understanding of how technology was acquired and how applications are designed underwent several major shifts:

Open source, now very much in the mainstream of enterprise IT, was getting better. As a result, expectations around ease of use and out-of-the-box functionality began to increase, and popular projects began to thrive outside the walls of the Apache Software Foundation and other traditional open source gatekeepers.

As a result, expectations around ease of use and out-of-the-box functionality began to increase, and popular projects began to thrive outside the walls of the Apache Software Foundation and other traditional open source gatekeepers. Cloud computing took over the world, making it easier not just to virtually provision servers, but also store data cheaply and to use managed services that tackle specific use cases — for example, data-processing via MapReduce.

making it easier not just to virtually provision servers, but also store data cheaply and to use managed services that tackle specific use cases — for example, data-processing via MapReduce. Docker and Kubernetes were born. Together, they opened people’s eyes to a new way of packaging and managing applications and infrastructure. Better yet, they are designed to be modular and attracted huge communities, meaning that users are free to swap in different pieces or use specific features as they please — and new features come online fast.

Together, they opened people’s eyes to a new way of packaging and managing applications and infrastructure. Better yet, they are designed to be modular and attracted huge communities, meaning that users are free to swap in different pieces or use specific features as they please — and new features come online fast. Deep learning brought artificial intelligence and machine learning into the spotlight. Rather than focusing on infrastructure, the discussion around deep learning was (and is) all about how models and algorithms can do complex pattern recognition without all the pesky work of hand-tuning everything. No, doing AI in production is not as easy as “just add data,” but anything that gets people talking about business opportunities rather than cluster sizes is going to be appealing.

Rather than focusing on infrastructure, the discussion around deep learning was (and is) all about how models and algorithms can do complex pattern recognition without all the pesky work of hand-tuning everything. No, doing AI in production is not as easy as “just add data,” but anything that gets people talking about business opportunities rather than cluster sizes is going to be appealing. Microservices became the de facto architecture for modern applications, followed by the emergence of “serverless” computing and functions. Both lend themselves to the idea of event-driven architecture — that is (in a vastly oversimplified explanation), Thing A happens and Service B does something in response.

Today: Events, AI, and “as-a-service”

This confluence of factors led us to where we are today. Yes, Hadoop still exists and is still evolving, but it’s nowhere near the foundational technology many predicted it would become. Rather, much of the world has embraced a faster, more modular, often simpler collection of tools and platforms, and a focus on data as a component of application architecture rather than just as something to be analyzed.

Specifically, we’re now seeing the following trends:

Streaming data and event-driven architectures are rising in popularity. The ideas have been around for a while, but technological and architectural advances have made into reality capabilities like stream processing and even function-based (aka “serverless”) computing. In many cases, the ability to act on data quickly is more valuable than a new method for batch-processing or historical data analysis.

The ideas have been around for a while, but technological and architectural advances have made into reality capabilities like stream processing and even function-based (aka “serverless”) computing. In many cases, the ability to act on data quickly is more valuable than a new method for batch-processing or historical data analysis. Apache Kafka is becoming the nervous system for more data architectures. Not only does Kafka facilitate many of the above-mentioned capabilities, but its popularity resulted in a priority shift for other projects and technology vendors. It is the data platform that everything else needs to integrate with (and once Kafka is up and running, it can handle getting data into Hadoop and other batch systems).

Not only does Kafka facilitate many of the above-mentioned capabilities, but its popularity resulted in a priority shift for other projects and technology vendors. It is the data platform that everything else needs to integrate with (and once Kafka is up and running, it can handle getting data into Hadoop and other batch systems). Cloud computing dominates infrastructure, storage, and data-analysis and AI services. It’s easier and often cheaper to store data with a service like Amazon S3 than it is to manage a complex file system. And cloud providers offer a vast number of ways you can analyze and model all that data, including via artificial intelligence and machine learning services. For many (but not all) companies, the downsides of managing their own data infrastructure and applications greatly outweigh the benefits.

It’s easier and often cheaper to store data with a service like Amazon S3 than it is to manage a complex file system. And cloud providers offer a vast number of ways you can analyze and model all that data, including via artificial intelligence and machine learning services. For many (but not all) companies, the downsides of managing their own data infrastructure and applications greatly outweigh the benefits. Relational databases — including data warehouses — are not going anywhere , and some (like Postgres) are even growing in popularity. The ease of operating them (or, in fact, not operating them) as cloud services is definitely a factor in their renaissance, as are the new capabilities influenced by Hadoop, NoSQL, and other new data technologies introduced in the past decade.

, and some (like Postgres) are even growing in popularity. The ease of operating them (or, in fact, not operating them) as cloud services is definitely a factor in their renaissance, as are the new capabilities influenced by Hadoop, NoSQL, and other new data technologies introduced in the past decade. Kubernetes is becoming the default orchestration layer for everything, including data systems and AI. This mitigates the need for a Hadoop-based data orchestration platform like YARN, and also encourages adoption of technologies more aligned with the cloud-native worldview (in a nutshell: microservices over monoliths, and many small clusters over one large shared cluster).

Developing a data architecture spanning numerous services and technologies is not necessarily easy, but today’s landscape of tools does offer a host of benefits than no single, all-encompassing platform can provide. Engineers can use the tools they know and like; experiment easily enough with new things as they emerge; and feel pretty confident it’ll all work together in the end. More importantly, they can hopefully let business requirements drive adoption of new technologies, rather than letting technology decisions limit the options of what the business can do.

Hadoop opened people’s eyes to what was possible with big data, but it’s also a reminder that no single technology is going to remake the world of enterprise IT— at least not anymore.