Let’s build a modern Hadoop

Not just another Hadoop rant — I promise

If you’ve been around the big data block, you’ve probably felt the pain of Hadoop, but we all still use it because we tell ourselves, “that’s just the way infrastructure software is.” However, in the past decade, infrastructure tools ranging from NoSQL databases, to distributed deployment, to cloud computing have all advanced by orders of magnitude. Why have large-scale data analytics tools lagged behind? What makes projects like Redis, Docker and CoreOS feel modern and awesome while Hadoop feels ancient?

Hadoop has an irreparably fractured ecosystem.

Modern open source projects espouse the Unix philosophy of “Do one thing. Do it really well. Work together with everything around you.” Every single one of the projects mentioned above has had a clear creator behind it from day one, cultivating a healthy ecosystem and giving the project direction and purpose. In a flourishing ecosystem, everything integrates together smoothly to offer a cohesive and flexible stack to developers.

Hadoop never had any of this. It was released into a landscape with no cluster management tools and no single entity guiding it’s direction. Every major Hadoop user had to build the missing pieces internally. Some were contributed back to the ecosystem, but many weren’t. Facebook, probably the biggest Hadoop deployment in the world, forked Hadoop six years ago and have kept it closed source.

This is not how modern open source is supposed to work. I think it’s time to create a modern Hadoop and that’s exactly what we’re trying to do at Pachyderm. Pachyderm is a completely new storage and analytics engine built on top of modern tools. The biggest benefit of starting from scratch is that we get to leverage amazing advances in open source infrastructure, such as Docker and Kubernetes.

This is why we can build something an order of magnitude better than Hadoop. Pachyderm can focus on just the analytics platform and use powerful off-the-shelf tools for everything else. When Hadoop was at this stage, they had to build everything themselves, but we don’t. The rest of this essay is our blueprint for a modern data analytics stack. Pachyderm is still really young and open source projects need healthy discussion to continue improving. Please share your opinions and help us build Pachyderm!