Recently, I have been talking to a number of companies about their interest in Big Data. Virtually always it revolves around the implementation of Hadoop with its HDFS and MapReduce technologies being at the core of the solution. There are many vendors that have caught the Hadoop bug and have released versions of the software such as Cloudera, HortonWorks, Microsoft with HDInsight as well as many others. However, when making the decision to use Hadoop, there does seem to be a bit of an elephant in the room with Google’s Dremel and BigQuery products. Google markets these as a different and more comprehensive way to distribute, parse and analyse data than with Hadoop. BigQuery is based on a product that Google invented a few years ago called Dremel, which is designed as a way to query extremely large datasets in real time. Sounds great – but I find it interesting why the industry is carrying on with Hadoop when Google has gone its own way. Some historical perspective would be useful here.

The origins of Hadoop were originally based on some whitepapers that Google released in 2004, which introduced MapReduce and the Google File System (GFS). These whitepapers showed the world a form of technology that Google had been working on for a number of years and that was quite established technology within Google. As ever, Google didn’t want to release the real technologies that they themselves used for their core products, so the GFS product was a watered down version of what they themselves used to drive their business.

In 2006, while at Yahoo, Doug Cutting (now chairman of the Apache foundation) and Michael J. Carafella used Google’s whitepapers to create Hadoop based on MapReduce and HDFS (as opposed to GFS). The Apache foundation adopted Hadoop and released it to the world. Hadoop then took quite a while to really get traction in the Industry (Yahoo themselves only implemented Hadoop in 2008) but Hadoop has now today developed into a really strong product with a massive ecosystem of tools to support it. (see Cloudera’s Hadoop schematic diagram below)

This is all great, and lots of companies and software vendors have invested a lot of time and effort into developing the Hadoop platform into a viable commercial product that is strongly in demand. But what about the company that started this distributed computing revloution in the first place, Google?

Simply put, they went another way. Google has decided that although they were the ones who invented MapReduce and HDFS, there are some flaws in this approach to distributed computing that they need to improve on. The biggest flaw is that Hadoop processes things in batch -not real time. While batch processing might work for some companies, or companies can adjust their Big Data analytics to fit a batch processing model, some companies need real time processing. The nature and speed of the online world where customer trend analysis or real time recommendation engines need to analyse and produce instant results means that Hadoop’s batch approach isn’t going to be fast enough. If you want to process massive amounts of data immediately and use these results in real time to affect your business decisions, you will struggle with Hadoop. So in 2009 Google stopped work on MapReduce and GFS / HDFS in favour of building a better, faster and more scalable technology to handle vast amounts of data at a speed needed by modern internet savvy businesses. This turned out to be Dremel.

Dremel has been designed to process data at incredible speeds, petabytes of data can be processed in a matter of seconds which MapReduce just can’t do. Incorporated into a commercial product, Dremel now sits at the heart of Google’s BigQuery product and with BigQuery Google has produced a Framework and API that lets the user upload their data and then run very sophisticated queries in one tool, all in realtime. If you look at this presentation by Google on Youtube you can see what a comprehensive tool BigQuery is. https://www.youtube.com/watch?v=QI8623HlYd4

BigQuery can actually be used in conjunction with Hadoop to query processed datasets you might produce from using MapReduce jobs. The Google whitepaper on Dremel mentions that Dremel can be used to compliment the batch processed jobs on MapReduce – and for many companies that have a Hadoop infrastructure in place already – using BigQuery, instead of Pig or Hive, might be a useful way to go. However, what if you are coming to Big Data for the first time, why would you use Hadoop as opposed to Dremel/BigQuery processing?

To me, the popularity of Hadoop over Dremel/BigQuery seems to be a bit like the VHS/Betamax argument of the 80s (for those that are old enough to remember it). Betamax was recognised as a far better product and a superior way to watch videos, but the consumer and the market liked the VHS format, and despite its flaws, it won the war. However, on reflection, maybe the argument is really that – those that are using Hadoop are stuck on VHS whilst Dremel is more akin to watching stuff on DVD or even MPEG-4! A completely new generation technology of doing Big Data.

When even Mike Olson, the CEO of Cloudera and the most popular distributor of Hadoop, said in 2009 “If you want to know what the large-scale, high-performance data processing infrastructure of the future looks like, my advice would be to read the Google research papers that are coming out right now,” then perhaps ultimately this shows you should re-evaluate your automatic adoption of Hadoop in order to see if Dremel is indeed a better way to solve your BigData puzzle.

From a UK recruitment perspective

At the moment a large part of the IT industry hasn’t got a clue about Big Data – and if they do, they use Hadoop. I think that Google has to do a lot more work to convince the industry as a whole about using Dremel and BigQuery. However, bearing in mind the place Google has in the world today, I would have thought innovative companies would check out this technology and will eventually use it. Why be stuck on VHS? BigQuery only became available commercially in the last half of last year, so it’s still new technology and adoption might start to happen in 2013.

For those people wanting to get involved in Big Data and to futureproof their skills, maybe learning about Dremel and becoming a BigQuery expert might be a good investment for the next couple of years. If the industry wakes up to Dremel, which it should, then those candidates that invested the time to learn about it, will find their skills in extreme demand.

By the way, I wasn’t sponsored by Google to write this blog, I just think from all the research I have done that there might be another, and maybe better, solution than Hadoop to Big Data. You dont have to just blindly adopt Hadoop because that is the current flavour of the month. I would love to hear your feedback.

Please follow me on Twitter: @Gavin_Badcock