By Himanshu Gupta



Millions of users around the world interact with Yahoo through their web browsers and mobile devices, generating billions of events every day (e.g. clicking on ads, clicking on various pages of interest, and logging in). As Yahoo’s data grows larger and more complex, we are investing in new ways to better manage and make sense of it. Behavioral analytics is one important branch of analytics in which we are making significant advancements, and is helping us accomplish these tasks.

Beyond simply measuring how many times a user has performed a certain action, we also try to understand patterns in their actions. We do this in order to help us decide which of our features are impactful and might grow our user base, and to understand responses to ads that might help us improve users’ future experiences.

One example of behavioral analytics is measuring user retention rates for Yahoo properties such as Mail, News, and Finance, and breaking down these rates by different user demographics. Another example is to determine which ads perform well for various types of users (as measured by various signals), and to serve ads appropriately based on that implicit or explicit feedback.

The challenges we face in answering these questions mainly concern storing and interactively querying our user-generated events at massive scale. We heavily make use of distributed systems, and Druid is at the forefront of powering most of our real-time analytics at scale.

One of the features that makes Druid very useful is the ability to summarize data at storage time. This leads to greatly-reduced storage requirements, and hence, faster queries. For example, consider the dataset below:



This data represents ad clicks for different website domains. We can see that there are many repeated attributes, which we call “dimensions,” in our data across different timestamps. Now, most of the time we don’t care that a certain ad was clicked at a precise millisecond in time. What is a lot more interesting to us, is how many times an ad was clicked over the period of an hour. Thus, we can truncate the raw event timestamps and group all events with the same set of dimensions. When we group the dimensions, we also aggregate the raw event values for the “clicked” column.



This method is known as summarization, and in practice, we see summarization significantly reduce the amount of raw data we have to store. We’ve chosen to lose some information about the time an event occurred, but there is no loss of fidelity for the “clicked” metric that we really care about.



Let’s consider the same dataset again, but now with information about which user performed the click. When we go to summarize our data, the highly cardinal and unique “user-id” column prevents our data from compacting very well.



The number of unique user-ids could be very high due to the number of users visiting Yahoo everyday. So, in our “user-id” column, we end up effectively storing our raw data. Given that we are mostly interested in how many unique users performed certain actions, and we don’t really care about precisely which users did those actions, it would be nice if we could somehow lose some information about the individual users so that our data could still be summarized.



One approach to solving this problem is to create a “sketch” of the user-id dimension. Instead of storing every single unique user-id, we instead maintain a hash-based data structure – also known as a sketch – which has smaller storage requirements and gives estimates of user-id dimension cardinality with predictable accuracy.

Leveraging sketches, our summarized data for the user dimension looks something like this:



Sketch algorithms are highly desirable because they are very scalable, use predictable storage, work with real-time streams of data, and provide predictable estimates. There are many different algorithms to construct different type of sketches, and a lot of fancy mathematics goes into detail about how sketch algorithms work and why we can get very good estimations of results.



At Yahoo, we recently developed an open source library called DataSketches. DataSketches provides implementations of various approximate sketch-based algorithms that enable faster, cheaper analytics on large datasets. By combining DataSketches with an extremely low-latency data store, such as Druid, you bring sketches into practical use in a big data store. Embedding sketch algorithms in a data store and persisting the actual sketches is relatively novel in the industry, and is the future structure of big data analytics systems.

Druid’s flexible plugin architecture allows us to integrate it with DataSketches; as such, we’ve developed and open sourced an extension to Druid that allows DataSketches to be used as a Druid aggregation function. Druid applies the aggregation function on selected columns and stores aggregated values instead of raw data.

By leveraging the fast, approximate calculations of DataSketches, complex analytic queries such as cardinality estimation and retention analysis can be completed in less than one second in Druid. This allows developers to visualize the results in real-time, and to be able to slice and dice results across a variety of different filters. For example, we can quickly determine how many users visited our core products, including Yahoo News, Sports, and Finance, as well as see how many of those users returned some time later. We can also break down our results in real-time based on user demographics such as age and location.

If you have similar use cases to ours, we invite you to try out DataSketches and Druid for behavioral analytics. For more information about DataSketches, please visit the DataSketches website. For more information about Druid, please visit the project webpage. And finally, documents for the DataSketches and Druid integration can be found in the Druid docs.

