Combining two Elasticsearch features: Sampler Aggregation & random scoring can help to create efficient estimated facets and insights while significantly reducing the cost of heavy & slow aggregations.

Random Sampling

Consider the following scenario: you have an Elasticsearch based search engine service that returns not only regular search results but also facets and insights, based on Elasticsearch aggregations on a very large scale. However, you wish to provide those insights on every query without knowing in advance how many results will return.

While handling a home page or a dashboard can be done via some sort of per-fetching and caching, you can’t tell in advance which query the user will enter. You can neither predict how heavy the aggregations will be or how this may slow the entire query as a result. The user can just filter a small percentage of the total results and his query would eventually be almost as heavy as a match all query that aggregates over all documents in the index.

Those heavy queries are not only very slow, but also very CPU intensive, and may lead to risks such as node crashes due to heap-out-of-memory issues.

Yet in most cases, a highly accurate aggregation result is not that significant, and in any case the elastic aggregations are approximate by design.

For example, when estimating the number of documents in some bucket, would it really matter to report 58,730,244 documents and not an estimation of around 60,000,000 documents?

Moreover, the most relevant query would typically be something like “Which are the top X buckets” for some field, together with the overall percentage.

Motivation: execute heavy aggregations on a sample of the data and get approximate results which closely represent the distribution of the real data.

All examples in this article rely on data from the DonorsChoose.org data set. The original data set is used but with a small manipulation:

data = data.sort_values(“donor_state”)

This will get the data to be indexed in a non-random order, and make it resemble to a real life system where data is not equally spread for each timestamp / source.