Believe it or not, it’s been over a decade since Hadoop got into the data management world. While, there are a few who are still doing Proof of Concepts ( PoC’s) to evaluate if it’s a fit for enterprise, I would say, largely , at least in the developed world, Hadoop has become an integral part of analytical eco-systems.

After all a decade in the field of IT is a lifetime –and I guess it’s also a good point to put things in retrospect and analyze what has Hadoop given to the Data Management world – and what lies ahead.

What has Hadoop achieved?

The Data Lake Vision

I believe the most lasting achievement of Hadoop is to enable the data lake vision. One can argue that Data Lake is a utopian concept – and what we actually have are data reservoirs – but that’s just semantics. The key point is that – it has broadened the boundaries of analytical eco-system. Had it not been Hadoop, the Social Media/IoT, and decades old historical data might not have ever been in the analytical eco-system. The data which was considered useless – is now considered useful, or at least it’s thought that it would become useful someday.

In other words, it has changed the way we look at data.

A new Mindset – The Hadoop Mindset

Hadoop second key achievement and again not directly related to technology is what I call the Hadoop Mindset. The traditional relational database management space was quite rigid. If let’s say we need to have a new functionality in IBM DB2 – we’ve to wait till IBM sees ROI in adding that functionality – and there is always a possibility that it might never happen.

The Hadoop Mindset is that if it’s not there - lets add it – now! It doesn’t rely on one company. If one company doesn’t see ROI in doing it, perhaps another would see. If nobody sees, there is always, at least theatrically a possibility to add that functionality yourself. It has taken the pace of development to a whole new level. In the Hadoop mindset, everything is possible, it doesn’t take no for an answer.

The reasons behind Hadoop success

No, its not Hadoop's Technology

Hadoop like any other data processing system consists of a File System ( HDFS), a processing engine ( MapReduce – more recently Spark, if you want to call it Hadoop too ) and a bunch of data governance and data ingestion software.

Now if we look beyond the hype, Is HDFS - technically superior than all other file systems produced? Is storing large amount of data a new thing? Is storing binary data or semi-structured data like HTML a new thing? Is MapReduce/Spark processing frameworks faster than IBM mainframes or Teradata MPP platforms? In other words, is the Hadoop technology better than what the database world with the likes of IBM, HP, Teradata, Oracle have produced in decades?

The answer surprisingly would invariably be no.

Contrary to general perception, Hadoop is not as such a groundbreaking technology.

Even prior to Hadoop we had petabytes of data stores. In fact, IBM GPFS which was created in 1998 can store up to 8 Yottabytes. In terms of processing speed, I guess it’s even more obvious that Hadoop is nowhere even close to MPP engines we’ve had for decades. Oracle, IBM, SAP, HPE, Teradata, Microsoft, all would have hundreds of use cases where they would have outperformed Hadoop in terms of data processing.

We’ve to realize, and that I think it is largely overlooked, the success of Hadoop is not because of its technology. It’s not a revolutionary technology, MapReduce has been there for years, In memory processing like Spark (again if you consider that Hadoop too) has been there for a while too. There hasn’t been any radical development in terms of parallel data processing or disk storage.

The two key factors which are responsible for Hadoop success are:

Cost

It’s the low cost of data storage that has enabled the data lake vision. Even prior to Hadoop, it was technically possible to store and process huge amount of data, in fact as far as a decade back we had petabytes of data stores operating pretty efficiently. But for most use cases the ROI wasn’t good enough to do store such huge amount of data.

Hadoop has lowered the cost of storing data significantly –which in turn is responsible for this influx of data in the analytical systems. What didn’t make economic sense before - now makes perfect business sense now. And that I would say is one of the most important success factor of Hadoop. The timing is also perfect – right when we’ve proliferation of data, we’ve a data store which is cost effective.

Open Source

Prior to Hadoop all major data management solutions were proprietary and were usually owned by one organization. Even for a minor change in functionality you are dependent on that organization R&D. If they see the ROI in adding that functionality, and that too would typically take quite some time, if they don’t see the ROI in adding that functionality, you just have to live with what you have. This made the progress in the data management world slow – and made it less flexibly and robust.

Hadoop is not owned as such by anyone and at least theoretically just about anyone can add new functionality to the Hadoop eco-system. This gave a new momentum to the data management world. We saw an unprecedented amount of development being done in the data world, and for the most part, its driven by the open source aspect of Hadoop. The likes of Oracle, IBM, Teradata, EMC all worked together on improving one solution – that is Hadoop. And then the number of start-ups which contributed to making Hadoop successful was incredible.

The Future ...

That all said, if we look at Hadoop usage trends, till date, it’s still largely not used as an end to end analytical platform. The exception is the likes of Facebook, Yahoo, LinkedIn etc. , who have an army of developers to customize any solution , but for the traditional enterprise customers e.g. in Financial, Retail, Telecom sector, it’s almost always used in conjunction with a traditional time tested Database management solution, the latter being used for more time critical or stricter Service Level Agreement (SLA) based analytics requirement.

In foreseeable future, I don’t see that changing.

For Hadoop to cross that bar and become a complete analytical platform, it would need considerable investment. And that investment in all likelihood would not keep the Hadoop eco-system free and open source. In fact, even now, in most enterprises the Hadoop eco-system does have a fair amount of licensed propriety components.

Hadoop eco-system would become more proprietary

To begin with the core Hadoop distributions provider like Cloudera, IBM, Oracle, MapR already have different license structure for different features. Most enterprise features do not have a free software license like ALV2. And then where we see a proliferation of licensed proprietary products is areas around data ingestion & data governance of Hadoop solutions – arguably it’s this area which will truly make Hadoop enterprise ready, Wandisco, Palantir, Podium, Alation, Datameer, Teradata QueryGrid, IBM BigSQL, Triffacta, Snaplogic are just to name a few.

We’ll see this trend growing – and see a lot more licensed solutions coming into the Hadoop world. From a totally free open source solution – it will transform into a hybrid eco-system – where a lot of products would be licensed and proprietary.

Hadoop would become more expensive

Contrary to general perception, Total Cost of Ownership (TCO) of Hadoop has never been low. Initially, when Hadoop was used only as a sandbox environment, the enterprise customers saved on licenses and platform cost, but paid heavily for Hadoop skills. At the end of the day, it was an expensive sandbox environment.

The demand and supply of Hadoop skills since then has become more balanced, and services cost has gone down considerably since the early days. However, now Hadoop use cases are evolving from sandboxing to core analytics. This shift will require more robust enterprise features and support contracts, which in all probability would result in an increase in Total Cost of Ownership ( TCO) . Just to give an example on how use case shift would increase cost, five years back nobody even talked about a Disaster Recovery (DR) solution for Hadoop, now this is becoming an important requirement.

Lastly, as we discussed above, more and more proprietary components are being added in the Hadoop eco-system. Most of them have a different cost model than the Apache licensed products and would result in increasing the overall cost.

In short, the cost of owning a Hadoop solution in a large enterprise would grow significantly. In the future, we’ll see less of “Free” and “Open Source” products within Hadoop eco-system – the two key aspects which made this eco-system popular.

Cost of Traditional RDMBS would go down

On the other hand, the cost of traditional RDMBS solutions would go down. And we’ll also see more openness in the propriety world. That doesn’t mean the traditional RDBMS would become open source but they’ll try making their products more extensible and provide more flexibility in customizing their products.

The core products would also start incorporating many of the “Big” data features in a much faster pace – e.g. Supporting new Real Time Data ingestion protocols and different semi-structured data formats.

This together with the more competitive cost would make RDBMS a direct competitor of Hadoop. The gap between Total Cost of Ownership between owning a traditional RDBMS product and a Hadoop solution would become significantly less. Then the competition would not be on the nonfunctional aspects like cost & open source but solely on technology. It would be interesting to see if Hadoop would still be able to make headways then.

The innovation would slow down

The technology definition of Hadoop has evolved significantly over the years.

Even out of the two founding technologies, which constituted Hadoop (MapReduce & HDFS), the original MapReduce is almost totally been replaced by improvements like TEZ or new technologies like Spark. What we called Hadoop ten years back and what we consider Hadoop today are two different things. We can safely say that Hadoop has developed much faster than any other data product in the history.

That said, the innovation in terms of Hadoop core product has significantly slowed down in recent past. We still see a lot of development activity, but it’s mostly focused on improvement not innovation. This trend I would say is in line with development lifecycle of most products, where once the product is established, the focus is more on optimization than innovation. In other words, we would perhaps see things like headphone jacks getting replaced by earbuds but a groundbreaking technology in the core product is now less likely.

#hadoop #bigdata #databases #rdbms #nosql







