HortonWorks is the Yahoo spin-off led by the development team responsible for much of the core architecture of Hadoop. The company officially launched its first enterprise of the open source distributed processing platform. Called the HortonWorks Data Platform (HDP), the software is already being embraced by partners who are eager to take "big data" mainstream. HortonWorks also announced a high-availability version of HDP, developed in partnership with VMware, that addresses some of the biggest issues in keeping Hadoop clusters up and running.

VMware isn’t the only company trying to hitch its cloud efforts to HortonWorks and Hadoop. The release comes as a host of enterprise and cloud companies are developing their own strategies with Hadoop in mind—including Hewlett-Packard. HP sees the open-source tool as another way to fight Oracle, the company's adversary in legal battles concerning Oracle's discontinued support for HP's Itanium microprocessor.

Hadoop is a framework that allows for the construction of large-scale analytical queries based on MapReduce jobs and other large-scale parallel tasks to be performed on unstructured data. It was originally developed by Doug Cutting (now at HortonWorks competitor Cloudera) and others associated with Yahoo as the basis for processing crawled Web pages to index them for search. Hadoop has now found favor in a broad range of analytical processing tasks. That’s largely because it can handle huge amounts of data inexpensively.

While Hadoop itself isn’t a database per se, it has been at the center of much of the "NoSQL" database movement. A number of companies have already jumped on the HortonWorks Data Platform, either to build it into their own "big data" analytical systems, or to integrate with it. "NoSQL" database players such as 10Gen (the developers of MongoDB) and analytical database vendors such as Teradata and Microsoft are working to integrate their tools into HDP to hand off complex queries to Hadoop. And HDP is the basis of Microsoft’s own plans, both for Azure and Windows servers—Microsoft has partnered with HortonWorks to build its Hadoop-based products.

The big problem with Hadoop has generally been the complexity and difficulty in managing it. Early versions of Hadoop had single points of failure that could lead to the loss of all the data dumped into a Hadoop cluster, or complex queries failing. And because of the batch-oriented approach of Hadoop jobs, clusters and code generally require a good deal of tuning to get results out at anything resembling transactional speed.

HortonWorks’ high-availability system attempts to address those issues by using VMware’s vSphere virtualization platform to automate the failover and restart of all the "master services" in a Hadoop cluster. This includes NameNodes, which track the location of data within Hadoop’s storage nodes; and JobTrackers, which manage the distribution of computational tasks to computing nodes. It also uses vSphere to detect underlying operating system failures and perform server restarts.

VMware isn’t the only virtualization player trying to make HortonWorks’ platform—and other Hadoop distributions—more manageable. Hewlett-Packard announced earlier in June that the company was integrating Hadoop into its own Converged Cloud virtualization platform in a cloud appliance called the HP AppSystem for Apache Hadoop. HP is partnered with Cloudera, but HP chief Hadoop architect Steve Watts told Ars that the company’s AppSystem was built to support HortonWorks and MapR distributions as well. HP is also providing reference configurations to customers who want to build their own optimized Hadoop implementations.

HP’s leadership sees Hadoop as a tool to sell more hardware and consulting services, to be sure. But it’s also a weapon for going after the customers of relational database vendors—especially Oracle. The benchmarks that HP used to talk about the performance of the AppSystem were aimed squarely at Oracle’s "big data" performance, claiming three to eight times better performance on the TeraSort benchmark than what Oracle has published.

"We think Hadoop is a general cross-cutting platform," Watts said. "People are using it as a low-cost data warehouse scale and cheaply store huge amounts of data." The most common adopters, he said, are "folks who are storing data in relational systems, and it’s getting prohibitively expensive."

Listing image by HortonWorks