Why we don’t sample or aggregate our metrics

There are sometimes ways that things have always been, or ways that have become standardized that we become accustomed to. Choosing to sample and aggregate metrics has become the de facto standard in traditional monitoring tools. Even as other observability tools move away from write-level aggregations, they continue to operate on samples.

At IOpipe, we have chosen to implement Whole Event Observability that is complete, unsampled, and non-aggregated.

Along with real-time metrics and mobile-first design, we believe this is a core tenet of serverless observability and should be part of any monitoring and debugging solutions for serverless.

The legacy of sampling

Typically, sampling refers to collecting or storing a subset of data, often aggregating it during transport or storage. Sometimes that data is non-aggregated, but is still a subset of events or metrics matching a rule.

Reasons organizations have sampled:

Complexities, limitations, and costs of ingestion

Cost and efficiency of storage and retrieval

Serverless, streaming, and database technologies have minimized these problems overall. Whole Event Observability is still a higher cost for data storage and retrieval compared to legacy systems, but it has become manageable via these technologies.

Don’t discard your metrics

Most traditional monitoring solutions have indexed metrics data written as time-series data with aggregated samples as time-slices. Data in such systems has cardinality based on time, not based on whole events.

These systems discard information for efficiency, but what happens when you’re trying to understand the full picture of users impacted during a performance issue with a serverless function? Time-series data is not enough.

At IOpipe, we do not discard this information, but collect and store events as individual documents.

Cloudwatch metrics — a traditional time-series

Reading whole event data

Using whole-event collection, we still do aggregations of data to provide useful histograms, averaging many events into what might be considered samples. However, this is done at the time of read (or from a secondary datastore).

High-cardinality, event-based data collection indexes multiple data points without samples, on both time and relation to a single event. Databases well-suited to the storage of documents are a good option for such data. At IOpipe, we utilize Elasticsearch, but S3, PostgreSQL, and other databases are also valid choices offering different pricing and availability models.

Invocation record, an event used for generating time-series data

Collecting high-cardinality events

The challenge of collecting too much data is always in understanding it.

However, if before deciding to collect, you have goals for how that data will be used, it becomes far more manageable. At IOpipe, we have decided to only collect data that we have already written valuable queries against. Yet, the data we collect is written in a non-aggregated and un-sampled way with whole event collection.

Kinesis and Kafka are readily available solutions to ingest and stream data into databases in efficient batches. Coupled with serverless compute, real-time streams of billions of records per month can be accomplished with a bill that’s only hundreds of dollars.

From Erica’s talk: “Ingesting Billions of Events without Breaking a Sweat!”

Perhaps unsurprisingly, storage and retrieval of data remains the most difficult cost to wrangle. Yet, with the growth of big data, solutions have arisen. Whether you choose storage on S3 with S3 Select and Athena, Elasticsearch, or another database, I believe that these are manageable costs.

Extracting value from Whole Event Observability

Knowing your “unknown unknowns” is easiest when you do not need to sample and throw away data. Having metadata on every HTTP request processed (or rejected) by your application, and the context of that request in your application as marked by structured logs —is far superior to not having that data, lost to sampling or aggregations.

Tracing provides architectural context. We automatically collect metadata about input events (parameters) to AWS Lambda functions, such as HTTP methods and resources, or if triggered by S3, bucket and object information. We do this for dozens of event sources on AWS. Tracing is useful and important, but alone tells an incomplete story primarily focused on architecture. For the most value, it needs to be tied to application-specific context.

Structured logs provide application-specific context. This is where whole event observability really shines. We highly recommend utilizing structured logs (or Custom Metrics) to connect events to order numbers, customers, or other application-specific context.

Solving Problems with High-Cardinality

We will follow-up with other content and specific examples, but users of IOpipe have been impressed by their ability to solve problems by correlating traces, traditional logs, and structured logs through a tool collecting data on every HTTP request and serverless event processed by their application.

Until then, check out our Webinar Series and Case Studies where we provide real-world examples that go deeper into how high-cardinality and whole-event observability has transformed the developer experience.