Imagine if alerting was what you wanted it to be:

Every alert you received was actionable, and there were few false alerts

Notifications were actually informative

You received alerts in time to fix problems before they impacted your users

This isn’t the world we live in…

We accept lots of notifications from our alerting system that are not actionable

The notifications don’t tell us about the problem

We get paged when stuff is dead and not when it is sick

In order to resolve the dissonance between reality and what alerting should be we need:

An expressive way to evaluate alert conditions that isn’t a 1:1 mapping to the metrics

Alerts backed by time-series and not just recent values

A way to to make rich notifications that include useful information

A way to iterate fast with alert design so that our alerts are continuously improved

A little less than a year ago, Matt Jibson and Kyle Brandt set out to create a system to solve this and other problems in monitoring; we call it Bosun. Our belief is that achieving excellence in alerting is a complex problem and requires a powerful and flexible platform to design alerts. Therefore, Bosun’s strategy is to provide a framework that enables the operator to create intelligent and informative alerts. We believe that you are smarter and more creative than any monitoring system can be when it comes to your environment.

In order to achieve that, at the highest level Bosun provides:

The Expression Language

We believe that every alert requires action. An alert asks for your attention, and human attention and time is a valuable asset. So alerting is about owning the operators attention. Taking action with alerts practically means one of two things. If the alert was accurate, then you fix the issue that triggered the alert. If the alert was a false positive, then the alert should be tuned in a way that the false positive won’t trigger the alert. This is where things tend to fall down because alert evaluations are not powerful enough to be tuned. With Bosun’s expression language, you can tune alerts in the following ways:

Alert thresholds based on history vs static thresholds (or both combined)

Statistics functions: Min, Percentile, Median, Deviations, Forecasting. You can change the duration that these evaluate over (i.e. 5 minutes, 1 hour, 1 week?)

Scope-aware: How should components in your environment be grouped? By Host, subsystem, cluster, a combination of those things

Boolean conditions: The interaction of multiple components

These possibilities, when applied selectively by a skilled operator, provide ample ways to reduce alerting noise.

Notification Templates

Once you have someone’s attention with a valid alert, you need to direct them to the problem as accurately as possible. Our notification templates use the Go template language, which means they can be quite flexible. Notifications in Bosun allow you to:

Include breakdowns of information related to your alert as embedded graphs, html tables, or whatever else you think makes sense

Include information that wasn’t directly related to the alert: i.e. CPU of a host even though it was a memory alert

Generate links to your dashboards or other sources of information

Includes notes about why you created that alert, caveats, and other information the person being notified should be aware of

The Workflow

One of the main issues with alerting is that there is so much friction to tuning alerts that it doesn’t get done. One of Bosun’s goals was to provide a faster iteration cycle for creating and tuning alerts by making the web interface an alerting IDE: Graphs in Bosun’s interface link to expressions, which then link to alert rules and templates. You can then test alerts before implementing; the results of a rule and template can be tested in the interface. You can test how they will behave currently, how they might have behaved at a past time, or generate a timeline of how they might have behaved over a range of time.

This means that your alert tuning doesn’t need to be totally reactionary. You can test alert changes and see how and when they would have triggered over the past weeks (or longer, if you are patient). This results in less alert noise being sent to operators.

But wait! There’s more!

Bosun has also attempted to make some problems in monitoring easier:

Getting data into the system: our agent (called “scollector”) runs on Windows and Linux and starts sending data to Bosun

Applications can push metrics to the system via JSON API calls

Human maintenance: Properly designed alerts will apply to new systems, and services are auto-discovered by scollector. This means you don’t have to remember to update your monitoring most of the time when a new services and hosts are deployed (as long as scollector is pushed out via your build or configuration management process)

We hope you go try this out. We have a docker image that has everything you need—just follow the getting started guide. We hope Bosun is useful to the community. We need your creativity and ideas to continue to grow it (and some contributors would be nice too!). We owe a special thanks to everyone else at Stack Exchange for: