Over the last decade, Yahoo has been a pioneer in the data infrastructure space and staunch supporter of the open source developer community. It has been incredible for us to witness the growth of the “big data” space and the technologies that have evolved in the ecosystem. We are especially proud of the growth of Hadoop, a project that was first developed and open sourced at Yahoo. To this day, we still run some of the world’s largest Hadoop clusters, and use it for everything from clickstream analysis to image processing and business intelligence analytics. Additionally, our developers continue to act as good open source citizens, and contribute all our Hadoop developments back to the community. While Hadoop still solves many critical problems in our business, as our needs have grown, we’ve come to realize that Hadoop is not the end all, be all solution to the entirety of our data problems.

The Need for Interactivity

Yahoo initially built Hadoop as an answer to a very acute pain around efficiently storing and processing large volumes of data. Ever since Yahoo open sourced Hadoop, it has become widely adopted in the technology world. However, time has taught us that when a system becomes extremely popular for solving one class of problems, its limitations in solving other problems become more apparent.

While MapReduce is a great general solution for almost every distributed computing problem, it is not particularly optimized for certain things. Specifically, MapReduce style queries are very slow. As our data volumes grew, we faced increasing demand to make our data more accessible, both to internal users and our customers. Not all of our end users were back end analysts, and many had no prior experience with traditional analytic tools, so we wanted to build simple, interactive data applications that anyone could use to derive insights from their data.

Initially, we attempted to power the data applications we wanted to build using both traditional and contemporary infrastructure choices, including Hadoop/Hive, relational databases, key/value stores, Spark/Shark, Impala, and many others. The solutions each have their strengths, but none of them seemed to support the full set of requirements that we had, including:

Adhoc slice-n-dice

Scaling to tens of billions of events a day

Ingesting data in real-time



It was only after some time that we stumbled across a new, then relatively unknown project called Druid.

Druid is a column-oriented, distributed, streaming analytics database designed for OLAP queries. The architecture blends traditional search infrastructure with database technologies and has parallels to other closed-source systems like Google’s Dremel, Powerdrill and Mesa. Druid excels at finding exactly what it needs to scan for a query, and was built for fast aggregations over arbitrary slice-and-diced data. Combined with its high availability characteristics, and support for multi-tenant query workloads, Druid is ideal for powering interactive, user-facing, analytic applications.

Another key property we really like about Druid is its lock-free, streaming ingestion capabilities, which are useful when ingesting tens of billions of events a day. Moreover, Druid’s extensions allow it to ingest data directly not only from open source systems like Kafka and Storm but also internal, proprietary systems, which means the technology fits nicely into our stack. Events can be explored milliseconds after they occur while providing a single consolidated view of both real-time events and historical events that occurred years in the past.

Lastly, Druid enables us to natively integrate sketches and other algorithms that we have developed. This integration allows us to maximally leverage the distributed, shared-nothing architecture that Druid provides for handling large amounts of data.

This feature set has allowed Druid to find a home in a number of areas in and around Yahoo, from executive-level dashboards to customer-facing analytics and even some analytically-powered products. Historically, when Yahoo has found value in an open source project, we have chosen to invest resources back into the project and are currently working with the community to help push Druid’s feature development forward.

We invite you to read about some of the great conversations we’ve had around the developments to our Hadoop infrastructure. If you want to learn more about Druid specifically, check out www.druid.io