I have been working on developing application on top of Hadoop for a while using all the components usually provided by Hadoop distributions: YARN, MapReduce, Spark, Zookeeper, HBASE, HDFS, Solr. Most of the times people use Hadoop clusters for running batch or real time data intensive jobs, which is completely okay, but, what I really like about Hadoop is the concept of the distributed platform that sits behind it. By using all these platform components it becomes extremely easy to develop and deploy fully distributed applications, whatever their nature is going to be. In addition to that, given the mainstream adoption of Hadoop, it is so easy to deploy a fully distributed application. Distributions like Cloudera or Hortonworks make installing a new cluster-wide application a piece of cake. What previously took days to deploy, now it can be done in minutes.

All this made me think several times about how powerful this platform can be. Not just for developing and crunching data, but for any sort of distributed application. If you think about it, the platform provides all sorts of facilities to simplify the life of a developer and the best part is that your application can be installed in a snap with little or no configuration. Liking this idea, I started using Hadoop not just for data application, but for any distributed development that requires to be installed in an enterprise environment. I can use YARN directly for hosting the execution of my code (with long running containers or ad-hoc allocated containers, similarly to what you can do with AWS Lambda), Zookeeper for discovering and synchronizing my nodes, HDFS for sharing files across the containers and so on.

This proved to be very good, but I noticed the lack of an application layer that could simplify and speed-up my development work. Whenever I wrote distributed code (on Hadoop or not), I always ended up reinventing the wheel for discovering and interacting with services across the cluster. Given the fact, I’m mostly developing on top of the JVM, I wanted an application layer that could enormously simplify my life during my development. In short, a Spring-like framework specifically designed for writing Hadoop-based applications.

My first question was: what are the basic services that any distributed application need? And this is the first list I compiled:

Service definition and service discovery : I want to be able to define and expose services, no matter where they reside. They could be sitting in the same JVM instance of the client, or on a remote instance, but I want to be able to lookup and use the service seamlessly, idepenendently of their location. Given the fact I’m mostly working with JVM-based languages, the most natural way of defining services for me would be using Java interfaces.

: I want to be able to define and expose services, no matter where they reside. They could be sitting in the same JVM instance of the client, or on a remote instance, but I want to be able to lookup and use the service seamlessly, idepenendently of their location. Given the fact I’m mostly working with JVM-based languages, the most natural way of defining services for me would be using Java interfaces. Synchronous and Asynchronous service communication: Once I have discovered a service I want to be able to use it. This seems simple, but there are many factors to consider. Communication can be synchronous or asynchronous. In the latter case it can be persistent (backed by a persistent queue), or not. To make life more complex, different policies might be required during the invocation of a service. We might have some services that are okay with a simple round-robin invocation policy, other services requiring more complex policies such as sharding or a failover mechanism with leader election.

If you think about it, those above are all generic concepts that you will encounter in any distributed application, but implementing and testing them every time would take a lot of time. This has been my main motivation for developing BeansZoo, a library that I keep reusing whenever I develop distributed applications. The whole idea behind it is that once you cover all the above scenarios, you pretty much covered most of the cases you will encounter when developing and running a distributed application. The library is still in its infancy, but it is the result of many years of writing distributed code and reinventing the wheel all the time.