Yahoo! is creating a new company with its core Hadoop engineering team, seeking to rapidly expand the scope of the open source distributed number-crunching platform and ultimately bring it to a much wider audience. In growing the Hadoop "ecosystem" through increased work on the core Apache-based open source project, the company hopes to eventually make its money by providing training and support for the platform.

"We believe that you should be able to get a fully-working version of Hadoop from Apache. There should not be any missing functionality," says Yahoo! vice president of engineering Eric Baldeschwieler, who will become the new company's CEO. "So, anything that's necessary to making Hadoop a complete, horizontal offering, we intend on building it in open source."

Horton

It's a commercial open source pitch of the purest kind. But it will be years before we can judge whether such an idealistic plan will actually work – and there's no guarantee the company will stick to the pitch.

The new company will be known as Hortonworks, a reference to the titular elephant from Dr. Seuss's Horton Hears a Who. Hadoop is named for a yellow stuffed elephant that once belonged to the son of project founder Doug Cutting.

Bearden hears a Hadoop

In late April, The Wall Street Journal reported that Yahoo! was "weighing" a Hadoop spinoff, and that it was discussing the possibility with Silicon Valley venture capital firm Benchmark Capital. At the time, Yahoo! would neither confirm nor deny the possibility with The Register. But earlier this week, GigaOM revealed that a Benchmark-backed venture was indeed on the way.

After hiring Doug Cutting in January 2006, Yahoo! bootstrapped the Hadoop project at Apache, and it is still the project's largest contributor. The platform has long underpinned Yahoo!'s online infrastructure, and for a while, the company offered its own Hadoop distro, based on the version of the software it ran internally. But in February, it discontinued this offering, choosing to put its weight behind the core Apache project, and somewhere along the way, Benchmark Capital approached the company about building a new startup around the project.

Benchmark was previously involved in such open source outfits as Red Hat, JBoss, SpringSource, and MySQL. Benchmark's Rob Bearden – who will serve as the chief operating officer of Hortonworks – played the same role at SpringSource before the Java framework house was sold to VMware. In the wake of the VMware acquisition, Bearden tells The Register, he and his colleagues began looking for the "biggest opportunity" in today's enterprise market, and they eventually settled on Hadoop.

"We looked at a lot of things, around social media and things like that," he says. "But it was very obvious, very quickly that being able to manage 'Big Data' is the biggest problem that CIOs have to solve, and they are looking for a new platform to do that with, as opposed to their existing relational [database] and [business intelligence] technologies. It was clear that Hadoop was the way they wanted to solve the problem."

The core Hadoop project is essentially a means of processing large amounts of data across clusters of low-cost machines. Consisting of the HDFS distributed file system and the Hadoop MapReduce platform that operates atop HDFS, it "maps" data-crunching tasks across a collection of distributed machines, splitting them into tiny sub-tasks, before "reducing" the results into one master calculation.

Benchmark considered investing in Cloudera, a Northern California startup that has already commercialized Hadoop. But Bearden and Benchmark didn't agree with the Cloudera business model. Cloudera uses what's sometimes called an "open core" model, offering its own open source Hadoop distro as well as a for-pay enterprise version of the platform that includes some additional proprietary tools.

"Our experience is that you have to have a pure-play model," Bearden says. "You have to be a packager or a distributor or you have to be an owner-creator. And to be that owner-creator, you have to have a majority of the committers under your company umbrella, and you have to embrace the open source methodology and the open source community."

Hadoop

There's a bit of a contradiction there. But the aim is to take hold of a majority of the open source project's core committers and expand the project as quickly as possible. Yahoo! had provided about 70 per cent of the Hadoop commits, and Benchmark felt this was the place to make things happen. It approached Yahoo! with the idea, and eventually, Yahoo! bit.

"It was a [pitch] well received," Bearden says. "A lot of the same thoughts were being explored at Yahoo!" Roughly twenty-five of Yahoo!'s Hadoop engineers will move to Hortonworks, including Baldeschwieler. Yahoo! will invest in the new company, which is expected to launch in July, and naturally, it will be a close partner. Baldeschwieler tells us that Hortonworks is getting Yahoo!'s "core expertise", but that some engineers on the fringes of Yahoo!'s Hadoop work will remain at the company.

Whose project is it, anyway?

Bearden insists that Hortonworks will not be a Hadoop consultant. It will provide Hadoop training and high-level support. But at least in the beginning, he says, the company's primary concern will be expanding the Apache Hadoop project. "As we make Hadoop more consumable as a platform, we create a vast ecosystem of companies and individuals that can build applications on it. Initially, we are going to be focused on the ease-of-consumption and productization of Hadoop for both the enterprise and the ecosystem in general."

Nonetheless, this puts Hortonworks in competition with Cloudera – an outfit founded by an all-star lineup of former Yahoo!, Google, Oracle, and Facebook employees – and EMC, which recently announced a for-pay Hadoop offering based on technology from Valley startup MapR. Currently, Cloudera provides support, services, and software for about 90 customers running the platform. EMC has yet to actually ship its Hadoop product, but thanks to MapR, it will provide key improvements to the Hadoop platform that are sure to please enterprise customers. The rub is that these improvements are closed source.

Despite Yahoo!'s claim to 70 per cent of Apache Hadoop commits, the open source project isn't necessarily centered on Yahoo!. In 2009, Doug Cutting left Yahoo! for Cloudera, where he's still on staff, and the startup also employs project cofounder Mike Cafarella. Facebook is another heavy contributor, and the platform is widely used by many other big web names.

Hadoop is based on research papers describing two of Google's proprietary back-end software platforms: GFS, its distributed file system, and MapReduce, the number-crunching piece. Cutting started the project for use with Nutch, his open source web crawler, but it grew into a much larger project when he joined Yahoo!. It now underpins Twitter and eBay as well as Facebook and Yahoo!.

Since the project was founded, it has been joined by myriad sister projects, including HBase (a real-time database based on Google BigTable), Hive (a SQL-like query language developed at Facebook), Sqoop (a MySQL connector built by Cloudera), Hue (a graphical user interface), and Zookeeper (a means of juggling distributed services from a central location that's based on Google's Chubby platform). ®