Introduction

There is a reason why analytics, big data & data science are being considered hot fields today. The landscape of tools is changing every 6 months. If we were to ask you: Can you think of complete replacement of Hadoop ecosystem in Big Data Industry? Think and proceed !

Pachyderm is one of the big data / analytics startup incubated at Y Combinator Winter 2015 Batch. They claim to offer a complete replacement for Hadoop ecosystem. They provide an open source analytics engine that uses Docker containers for distributed computations. Their product is known to have the power of MapReduce without the complexity of Hadoop. This very idea caught our attention and we decided to catch up with the management of Pachyderm to know about this idea and their ways of doing work.

Even in his extremely busy schedule, Mr. Joe Doliner, Co-Founder & CEO of Pachyderm agreed to do an exclusive interview with us.

This conversation was aimed at knowing more about this exciting startup and how it is making using innovation to simplify handling big data. We were enthralled by his knowledge of big data / analytics domain. Talking with Joe gave us a lot of insights on how can creativity be applied to analytics which we’ll share with you!

Below is the brief transcript of this conversation:

AV(Analytics Vidhya): Thanks Joe for taking time out of your busy schedule. Doing analytics using docker is a new concept. Tell us more about this concept. How did you get this idea?

Joe: I’ve been interested in building a better Hadoop for a while. It always seemed needlessly inaccessible to me which is a shame because it’s such a powerful technology if you can harness it. However, it’s also a very hard problem and I didn’t see a way to solve it until Solomon showed me Docker.

I got an early peek at Docker when I was working at RethinkDB. This was really early on, in fact Docker was just a file that faked the interface, but I could already tell it was going to be cool. I really wanted to play with it and had a hunch that it might be the missing piece of the puzzle, so I set to work hacking with it. That code eventually turned into Pachyderm.

AV: Tell us about your products & services? How would you plan to position it in the market?

Joe: Pachyderm is a set of tools for doing analytics with Docker. Docker’s a platform on which a lot of great things are going to be built on. It’s very important that we position ourselves as part of that movement. Already we’re finding that our earliest adopters are people and companies that have heavily invested in the Docker ecosystem.

AV: I am sure you would have faced many hurdles before reaching this stage. What were these hurdles? How did you tackle them?

Joe: Geez, there have been a bunch. Getting people, especially investors, to even consider that Hadoop could be replaced has been a big challenge. I wouldn’t say that one’s fully tackled yet, but the only solution is to ship products that get users excited.

AV: What would be the plan B, if this hadn’t worked out for you?

Joe: This was the plan B, I wanted to be an astronaut.

I’m actually not sure what plan B would have been. Pachyderm really just started as me hacking on a side project because it seemed interesting. Even if we hadn’t gotten into YC or gotten investments from anyone, I’m pretty sure I’d still be hacking on it. Although I’d certainly have less time to do so. That’s the nice thing about personal projects, the only way they don’t work out is if you give up on them.

AV: As per your website, ‘Pachyderm will eventually be a complete replacement for the Hadoop ecosystem’. How do you plan to do so?

Joe: By empowering open-source developers. That’s how Hadoop got to where they are today. Hadoop actually started out as a component of another project, Nutch, but it was very hackable and people wound up repurposing it to their needs. We’ve built Pachyderm from the ground up to be easily hackable because that’s the only way we can succeed. Open-source, when done correctly, lets thousands of developers all work together to create something great. That’s really the only way to create an ecosystem, you can’t just hire that many people and even if you could, good luck managing that team.

Open-source, when done correctly, lets thousands of developers all work together to create something great. That’s really the only way to create an ecosystem, you can’t just hire that many people and even if you could, good luck managing that team. – Mr. Joe Doliner, Co-Founder & CEO of Pachyderm

AV: What will be the impact of Docker on current set of softwares available in analytics industry? Do you think Docker can bring a change in the way current analytics industry is being operated?

Joe: The first order effect is that a lot of things should get less annoying. Things like setting up standard environments so that results can easily be reproduced are low hanging fruit that Docker can already solve well.

The second order effect though is that Docker, and the ecosystem surrounding it, are making distributed systems an order of magnitude easier. That’s going to change the way people handle big data sets. A few years from now, data scientists will be able to click a few buttons and be running R in parallel over petabytes of data. Bringing about this second order effect is Pachyderm’s reason for existing.

AV: Where do you see the market of Big Data industry evolving – 3 years down the line? 5 years down the line?

Joe: I expect we’ll see it pervading companies more. The last 5 years we’ve seen big data explode on the back of a few well understood, important use cases. Things like optimizing user revenue and churn, those were the low hanging fruit. But over the next 5 years we’re going to see the long tail in which companies discover all the other ways data can inform their decisions. This is also going to turn employees who can do a bit of data science on top of their normal responsibilities into really valuable assets, since they’ll be able to see the ways that data can help the team. I’m probably preaching to the choir on this one, but if you’ve been thinking about brushing up on your data skills now is a good time to do it.

AV: What are the present challenges in your mind and what are your future plans for next 3 years?

Joe: As I mentioned before, we live and die by our community, so nailing that experience is our biggest challenge and probably will be for the next several years. Jump starting those developer communities is always hard for small companies. That’s one of the biggest things we’ll be grading ourselves on in the next few years.

Hiring is also a big immediate challenge that we face. But that’s every tech company’s biggest problem.

AV: I see a lot of people who are interested to make a shift in big data analytics. What would be your guiding suggestions for them?

Joe: First off, I’d 100% recommend that they take the plunge. Second I’d recommend thinking in terms of projects rather than abstract goals like “learn tool X.” I’ve never been able to learn things when I think about it like that. You’ll fair a lot better if set out to learn something cool from a data set and use that as an excuse to learn tool X. Actually early on Pachyderm was an excuse for me to learn Docker and Go. The third thing is to look out for the good communities. Long term the most relevant tools always wind up being the ones with the best communities around them.

P.S – Our efforts doesn’t end here. We plan to cover some more exciting startups in coming days. Do subscribe!

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Related Articles

You can also read this article on our Mobile APP