Histograms are an important part of metrics and monitoring. Unfortunately, it is computationally expensive to accurately calculate a histogram or quantile value. A workaround is to use an approximation and a great one to use is t-digest. Unfortunately again, there’s no such thing as an accurate approximation. There are various t-digest implementations in Go but they do things slightly differently and there’s no way to track how accurate their approximations are over time and no great way for consumers of quantile approximation libraries to benchmark them against each other.

Using Benchmark.ReportMetric

Luckily, Go 1.13’s new ReportMetric API allows us to write traditional Go benchmarks against custom metrics, such as quantile accuracy. The API is very simple and pretty intuitive, taking a value and unit.

Values that scale linearly per number of operations, like time to do something or memory allocation, should divide the reported number by b.N . Since correctness isn’t a value that scales linearly, I calculate it individually for a specific number of values and report each as individual benchmarks.

Creating benchmark dimensions

The core problem is “How correct are tdigest implementations in various scenarios”. To quantify this problem I need to break it down into a dimension set and report all data along each dimension.

Dimension set: Implementation

I benchmark the following t-digest implementations for correctness:

To create a dimension out of implementations, I convert each implementation to a base interface

The type digestRun will be my dimension for the digest I’m testing against.

Dimension set: numeric series

How quantiles perform depends a lot on the order and nature of the input data. This makes the number series another dimension of my benchmark.

I decided to benchmark against the following patterns of numbers

linearly growing sequence

random values

a repeating sequence

an exponentially distributed sequence

tail spikes: mostly small values and a few large ones

The type sourceRun becomes my dimension for the source of the data.

Dimension set: size and quantile

The last and simplest dimension sets are how many values I add before I evaluate correctness and which quantile I evaluate correctness against. These become my dimensions size and quantile .

Combining the dimensions together

From all of these dimensions I can use SubBenchmarks to create my final benchmark. Notice how each dimension becomes a subbenchmark.

As I add dimensions to my benchmarks, I create another b.Run . I use the pattern key=value as recommended on the benchmark data format proposal. At the most inner loop of my subbenchmarks, I compare the actual quantile result, calculated with allwaysCorrectQuantile , to the quantile result for the given digest. To hide ns/op from my benchmark results, I report the special value 0 for b.ReportMetric(0, “ns/op”) .

Running the benchmarks

Since these are go benchmarks, I can run them just like normal Go benchmarks. One thing I do for my benchmark runs is filter out all the tests so just the benchmarks run. Here is what my Makefile has.

Here is what one line of the benchmark output looks like

BenchmarkCorrectness/size=1000000/source=exponential/digest=influxdata/quantile=0.999000-8 1000000000 0.118 %difference

This lines of benchmark output gives us a datapoint.

Value=.118 %difference

benchmark=BenchmarkCorrectness

size=1000000

source=exponential

digest=influxdata

quantile=0.999

Using benchdraw

There is a large amount of benchmark output. To visualize it I use benchdraw . Benchdraw is a simple CLI made for Go’s benchmark output format and allows drawing “good enough” graphs and bar charts from dimensional data.

Here is an example command of benchdraw that visualizes the ns/op graph below. This benchdraw command plots the source dimension on the X axis and ns/op on the Y axis, leaving the digest dimension as bars.

benchdraw --filter="BenchmarkTdigest_Add" --x=source < benchresult.txt > pics/add_nsop.svg

ns/add operation

You can find lots more documentation on the benchdraw github page.

Correctness results

The correctness of an implementation’s approximation algorithm depends a lot on the nature of the data and the quantile we’re looking at. Plotting all the data at once, without axis modifiers, shows there are clear outliers in the segmentio and influxdata implementation that make it difficult to compare. For all correctness results, lower (a smaller % difference) is better.