This report describes my experience designing, implementing, and deploying a Go API for instrumenting programs with metrics for production monitoring. The C++ and Java APIs used templates and generics; since Go lacks those features, we developed an API that requires some runtime checks. Those runtime checks eventually caused failures in production, requiring us to replace them with logging and monitoring instead. Those failures could have been avoided altogether with support for generics in Go.

What we wanted to do

One of the first APIs I developed for Go inside Google was a client for Monarch, Google’s monitoring system. Monarch collects metrics from Google’s production jobs and provides infrastructure for querying, plotting, and alerting based on these metrics and derived values.

For example, an HTTP server might export a metric that counts the number of requests it has served. The metric may include fields, like the HTTP method string and response code, in which case the metric has a separate count for each combination of field settings: for example, separate counts for (“GET”, 404), (“GET”, 200), and (“PUT”, 200). Monarch collects this metric from all the HTTP servers and provides ways to calculate derived values, like the error rate, and escalation to responsible teams if this rate gets too high.

Instrumenting the HTTP server requires adding code to define the metric and update it as requests are served. The library handles aggregating the metric in memory and transmitting it to Monarch. Each language used at Google needs this library, so I implemented the one for Go. Monarch already had C++ and Java APIs, so I followed their example.

A metric in Monarch has a value and zero or more fields. The supported value types are boolean, integer, float, string, and distribution (a histogram of floating-point samples). The supported field types are boolean, integer, and string.

In C++, the API for defining metrics uses templates to create an object that has type-safe methods to update the value associated with a given setting of fields.

A Counter is an integer-valued metric. The C++ Counter template is variadic in the field types:

template <typename… Fields>

class Counter {

This C++ code defines a metric to count HTTP requests, as described above:

auto* http_requests = Counter<string, int64>::New(

“/myserver/http/requests”,

“method”, “code”,

STREAMZ_METADATA(“HTTP request count for MyServer, broken out by HTTP method and response code.”));

Code that updates the metric specifies the field settings. The counter created above has two fields, of type string and int64 , so the Increment method’s arguments are string and int64 :

http_requests->Increment(“GET”, 404);

The Java API uses generics, but since Java does not have variadic templates, the user must specify the number of fields in the type, in this case Counter2:

static Counter2<String, Long> httpRequests = MetricFactory.getDefault().newCounter(

“/myserver/http/requests”,

new Metadata(“HTTP request count for MyServer, broken out by HTTP method and response code.”),

Field.ofString(“method”),

Field.ofInteger(“code”));

This Increment method’s arguments are String and Long :

httpRequests.increment(“GET”, 404);

Go does not have generics, so we needed a different approach.

What we actually did

The Go metrics API sticks pretty close to the C++ and Java API, to ease transition to Go for Googlers already using those other languages. The metric.NewCounter function accepts …field.Field to specify the field types of the metric:

httpRequests := metric.NewCounter(

“/myserver/http/requests”,

“HTTP request count for MyServer, broken out by HTTP method and response code.”,

nil, // optional metadata

field.String(“method”),

field.Int(“code”))

Go provides an Add method instead of Increment, but otherwise the call looks similar:

httpRequests.Add(1, “GET”, 404)

The signature of this call is Add(n int64, fieldVals …interface{}) . Unlike the C++ and Java APIs, this Go API does not ensure the fields have the correct type at build time. This is checked instead at run time. To compensate for this lack of type-safety, we encourage Go users to define closures that enforce the right types:

addHTTPReq := func(method string, code int) {

httpRequests.Add(1, method, code)

} addHTTPReq(“GET”, 404)

Next, we’ll consider some alternative APIs, then we’ll discuss the consequences of this API design.

Alternative: Generated code

One alternative would be to generate type-safe stubs for all likely combinations of field and value types. As we saw earlier, Java defines separate templates Counter0, Counter1, Counter2, etc. for counters with zero, one, two, or more fields, respectively. The current Java code has generated support for metrics with up to 10 fields.

Let’s calculate the number of types we’d have to generate for all combinations of field and value types. There are 3 supported field types and 5 supported value types, so we need 5 types for zero fields, 3¹ *5 = 15 types for one field + 3² * 5 = 45 types for two fields + … 3¹⁰ * 5 = 295,245 types for 10 fields. So there are ~300K types of metrics.

A counter is a convenience form for a cumulative, integer-valued metric. We’d need to generate types for all its variants, too: another 3¹⁰ = ~60K.

In addition to metrics that store their own value (as I’ve described so far), there are also callback metrics, which read from other state already stored in the program. This doubles the number of types we need to over 700K.

Only a tiny fraction of these types would actually be used in a program; generating them all would waste build time and generate names like NewStringIntBoolIntCounter (or NewSIBICounter, I suppose). Instead we’d want a system that generates (or instantiates) only the types that are used in the program.

Alternative: Struct key

Another variant of our Go API is to replace the …interface{} for fields with a single interface{} , and have users define a key struct for each metric. The key struct would contain a field for each key field. For example:

type httpReqKey struct {

Method string

Code int

}

…

httpRequests.Add(1, httpReqKey{“GET”, 404})

This provides a little more type safety than our current API, but it’s more verbose at the call site, it requires the user define a new key type for each metric, and it’s still possible to write Add calls that provide an argument of the wrong overall type (using the wrong struct). While we liked this API, we decided that the small added type safety does not outweigh the added burden on the user.

Alternative: Labels (Prometheus)

The Go client API for Prometheus is similar to the one I developed for Monarch. But where Monarch metrics support typed fields, Prometheus metrics use string Labels, which can be specified as a list of settings (here called “label values”) or via a map:

httpRequests := prometheus.NewCounterVec(

prometheus.CounterOpts{

Name: “http_requests_total”,

Help: “How many HTTP requests processed, partitioned by status code and HTTP method.”,

},

[]string{“code”, “method”},

)

prometheus.MustRegister(httpReqs)

…

httpRequests.WithLabelValues(“404”, “POST”).Add(1)

httpRequests.With(Labels{“code”: “404”, “method”: “GET”}).Add(1)

The Labels API only accepts fields of type string, so it cannot have the same kinds of type errors as the Monarch API. However, this API still has potential ordering problems that richer types made impossible: both WithLabelValues(“404”, “POST”) and WithLabelValues(“POST”, “404”) compile, but only one is correct. The Labels map form eliminates the ordering problem but introduces the potential for having incorrect map keys. The Labels map form also incurs the overhead of creating a map just to specify a pair of strings in order to increment a value. As with the Monarch API, the Labels API still cannot guarantee at compile time that each Add call has the correct kinds of arguments.

What didn’t work well

There are several things that can go wrong in this call to a Go metric:

httpRequests.Add(1, method, code)

method might not be a string code might not be an integer There might be too few fields There might be too many fields The fields might be in the wrong order

The compiler will catch none of these errors. Instead, the library does these checks during the Add call and crashes the program if they fail.

This worked fine for three years. We launched the Go Monarch API inside Google in 2012, but it wasn’t until 2015 that the first production failure was attributed to this run time checking. Nonetheless, this outage was serious enough to warrant revisiting our decision.

We replaced all the log.Fatal calls in the library with log.Print and only crash if the program was a test binary. If the program is running in production, we instead increment a metric that counts these errors, so that users can monitor for them.

None of this would be needed if we Go had a richer type system, as in most implementations of generics. As we saw with the C++ and Java APIs, errors #1–4 above can be caught at build time, and #5 is caught when the fields have different types.

While I’ve rarely needed generics in Go, I needed them here, and their absence made the code less scalable (since errors in these calls are more likely in a large codebase that changes frequently) and made Go production systems less reliable.