BLOG@CACM Possible Hadoop Trajectories

Hadoop has spread rapidly in the last few years as a platform for parallel computation in Java. As such, it has achieved the goal of bringing parallel processing to the millions of programmers in the Java community. Prior attempts to do this (Java Grande, JavaHPC) have not succeeded, and we applaud Hadoop for its success in this area, which we believe is due largely to the simplicity and accessibility of its environment.

However, we see lots of improvement that will be required for serious production use, at least in the science domain of a large national laboratory, such as Lincoln Lab, where one of us works. Briefly, the dominant use cases for Hadoop in our environment are parallel computation, mostly scientific analytical codes, and information aggregation and rollups of data sets.

We will discuss these two use cases in turn.

Hadoop Computation

Many of our scientific codes organize a collection of nodes into a 2-D (or 3-D or N-D) rectangular partitioned grid. Then, each node executes the following template;

Until end-condition {

Local computation on local data partition

Output state

Send/receive data to/from a subset of nodes holding other data partitions

}

This template describes most computational fluid dynamics (CFD) codes, essentially all atmospheric and ocean simulation models, linear algebra operations, sparse graph operations, image processing, and signal processing. When considering this class of problems in Hadoop, the following issues arise:

The local computation is very stateful from one iteration step to the next. Preserving state across successive MapReduce steps requires writing it to the file system, an expensive alternative. Also, the above codes require direct node-to-node communication, which is not supported in the MapReduce framework. Third, these codes bind a particular computation to the same node across iterations. Again, this is not supported in the MapReduce model.

We estimate that the MapReduce model works well for about 5% of Lincoln Lab users. The other 95% must shoehorn their computation into this model, and pay 1-2 orders of magnitude in performance as a result. Few scientists are willing to take this hit, except on very small data sets.

Many argue that performance does not matter. That may be true at the low end; however, for the loads we see at Lincoln Labs as well as the data centers we are familiar with, performance matters a lot, and there are never enough computing resources. Our institution is leading a collaborative investment of $100 million to place our next generation supercomputing center near a hydroelectric dam to reduce its carbon footprint by an order of magnitude. The performance loss inherent in Hadoop is an impossible-to-absorb penalty.

Even at lower scale, it is extremely eco-unfriendly to waste power using an inefficient system like Hadoop.

In summary, we see the following steps in Hadoop adoption in the computation arena.

Step 1: Adopt Hadoop for pilot projects.

Step 2: Scale Hadoop to production use.

Step 3: Hit the wall, as the above problems become big issues.

Step 4: Morph to something that deals with our issues.

At Lincoln Labs we have projects at all four steps. Survival of Hadoop in our environment will require major surgery to the parallel computation model, complementing the current Hadoop work on the task scheduler. Our expectation is that solving these issues will make current Hadoop unrecognizable in future systems.

It is possible that other shops have a job mix that is more aligned with the current MapReduce framework. However, our expectation is that we are more the norm than the exception. The evolution of Google away from MapReduce to other models lends credence to this supposition. Hence, we fully expect a dramatic future evolution of the Hadoop computation framework.

Hadoop Data Management

Forty years of DBMS research and enterprise best practices has confirmed the position advocated by Ted Codd in 1970: Programmer and system efficiency are maximized when data management operations are coded in a high-level language and not a record-at-a-time language. Although the MapReduce model is higher level than a record-at-a-time system, it is obviously way easier to code queries in Hive than by using MapReduce directly. Hence, we see essentially all Hadoop data management moving to higher level languages, i.e., to SQL and SQL-like languages.

For example, according to David Dewitt [1], the Facebook Hadoop cluster is programmed almost exclusively in Hive, a high-level data access language that looks very much like SQL. At Lincoln Labs a similar trajectory is occurring, although the higher-level language of choice is not Hive, but is instead a high level sparse linear algebraic interface to the data [2,3].

As such, the current MapReduce model becomes an interface that is internal to the DBMS. In other words, Hive users don’t care what is underneath the Hive query language, and the current MapReduce interface disappears into the innards of a DBMS. After all, how many of you actually care what wire protocol is used by a parallel DBMS to communicate with worker code at individual nodes?

One of us has written five parallel DBMSs, and is obviously quite familiar with the protocol required to communicate between a query coordinator and multiple workers at various local nodes. Moreover, worker nodes must communicate with each other to pass intermediate data around. The following characteristics are required for a high performance system:

Stateful worker nodes, which can retain state across successive steps in a distributed query plan

Point-to-point communication

Binding query processing to data that is local to the node.

Roughly speaking, a DBMS wants the same kinds of features as the scientific codes noted above. In summary, MapReduce is an internal interface in a parallel DBMS, and one that is not well suited to the needs of a DBMS.

Some of us wrote a paper in 2009 comparing parallel DBMS technology with Hadoop [4]. In round numbers DBMSs are faster by 1-2 orders of magnitude. This performance advantage comes from indexing the data, making sure that queries are always sent to the nodes where data resides and not the other way around, superior compression, and superior protocols between worker nodes. As near as we can tell, the situation in 2012 is about the same as 2009; Hadoop is still 1-2 orders of magnitude off the mark. Anecdotal evidence abounds. For example, one large Web property has a 5 Pbyte Hadoop cluster deployed on 2700 nodes; a second has a 5 Pbyte instance supported by a commercial DBMS. It uses 200 nodes, a factor of 13 less.

Therefore, we see the following trajectory in Hadoop data management:

Step 1: Adopt Hadoop for pilot projects.

Step 2: Scale Hadoop to production use.

Step 3: Observe an unacceptable performance penalty.

Step 4: Morph to a real parallel DBMS.

Most Hadoop sites are somewhere between steps 2 and 3, and “hitting the wall” is still in their future. Either Hadoop will grow into a real parallel DBMS in the necessary time frame or users will move to other solutions, built with replacement parts inserted into Hadoop or with interfaces to Hadoop to ingest data, or in some other way. Given the slow progress we have observed in the last three years, our money would be on the second outcome.

In summary, the Gartner Group has formulated the well-known “hype cycle” [5] to describe the evolution of a new technology from inception onward. Current Hadoop is promised as the “best thing since sliced bread” by its advocates. We hope that its shortcomings can be fixed in time for it to live up to its promise.

References

[1] http://searchsqlserver.techtarget.com/news/2240126683/Cloaked-in-secrecy-Microsoft-project-aims-to-wed-SQL-NoSQL-databases

[2] Jeremy Kepner, et. al., “Dynamic Distributed Dimensional Data Model (D4M) Database and Computation System,” ICASSP, March 25-30, 2012

[3] http://accumulo.apache.org

[4] Andy Pavlo, et. al., “A Comparison of Approaches to Large Scale Data Analysis,” Proc. SIGMOD 2009, Providence, RI, June 2009.

[5] http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp

Disclosure

In addition to being an adjunct professor at the Massachusetts Institute of Technology, Michael Stonebraker is associated with four startups that are either producers or consumers of database technology. Jeremy Kepner is a senior technical staff member at MIT, and a research affiliate at the MIT Mathematics Department and the MIT Computer Science and AI Lab.

Comments

Anonymous

"First they ignore you, then they laugh at you, then they fight you, then you win."

Agreed #Hadoop = eco-unfriendly, but will all respect you didn't get what #Hadoop is all about. Hint is not Hive.

Anonymous

The 2700 node Hadoop cluster was assembled when? 4 years ago? You can put together a 5PB Hadoop cluster using 200 24TB nodes using today's hardware, which is the same hardware that the anecdotal 200 node parallel DBMS uses.

Anonymous

surprised by a lack of a SciDB plug.

Anonymous

Have you considered the BSP model (implemented by Apache Hama and others) which addresses most of the mentioned gaps? Preserving state across steps, node-to-node communication, binding computation to a particular node across iterations.

Anonymous

Hi I'm Thomas from the Apache Hama team, looks like MapReduce does not suit your problem well.

As the poster above me already told you, BSP does suit your needs a bit better. You have the messaging between the tasks and a synchronization primitive.

We are constantly improving our framework and plan to graduate within the next month.

If you're and the Lincoln Labs are interested I'd be glad to invite you to our mailing lists:

http://incubator.apache.org/hama/mail-lists.html

And/or have a look if the programming model suits your needs:

http://incubator.apache.org/hama/hama_bsp_tutorial.html

Thanks for that article!

Displaying all 5 comments