In the analysis of big data there are often problem queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis.

If approximate results are acceptable, there is a class of specialized algorithms, called streaming algorithms, or sketches that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of real-time analysis, sketches are the only known solution.

For any system that needs to extract useful information from big data these sketches are a required toolkit that should be tightly integrated into their analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours or minutes on a number of its internal platforms.

DataSketches has been accepted to Apache Incubator and I will be a Mentor of the Apache DataSketches project. I’ll conduct the DataSketches community in order to align with the Apache Way.

[1] https://datasketches.github.io/

[2] https://medium.com/@furkankamaci/open-source-software-development-and-apache-incubator-372cc90081ae