Who am I to tell you the future about the Prometheus exposition format? Nobody!

I was at the PromCon in Munich in August 2018 and I found the conference great! A lot of use cases about metrics, monitoring and Prometheus itself. I work at InfluxData and we were there as a sponsor but I followed a lot of talks and I had the chance to attend the developer summit the next day with a lot of Prometheus maintainers. Really good conversations!

To be honest, my scope a few years ago was very different, I was working in PHP writing a web application that yes I was deploying but I wasn’t digging too much around them and I was not smart enough to understand that all the pull vs push situation was just all garbage. Smoke in the eyes that luckily I left behind me pretty soon because I had the chance to meet smart people that drove me out.

Provide a comfortable way for me to expose and store metrics is a vital request and the library needs to expose the RIGHT data it doesn’t matter if they are pushing or pulling.

RIGHT means the best I can get to have more observability from an ops point of view, but also from a business intelligence perspective probably just manipulating again the same data.

It is safe to say that a pull-based exposition format is easy to pack together because it works even if the server that should grab the exposed endpoint is unavailable or even if nothing will grab them. A push-based service will always create some network noise even if nobody has an interest in getting the metrics.

Back in the day, we had SNMP but other than being an internet standard the adoption is not comparable with the Prometheus one if we had how old it is and how fast Prometheus grew the situation gets even worst.

.1.0.0.0.1.1.0 octet_str "foo"

.1.0.0.0.1.1.1 octet_str "bar"

.1.0.0.0.1.102 octet_str "bad"

.1.0.0.0.1.2.0 integer 1

.1.0.0.0.1.2.1 integer 2

.1.0.0.0.1.3.0 octet_str "0.123"

.1.0.0.0.1.3.1 octet_str "0.456"

.1.0.0.0.1.3.2 octet_str "9.999"

.1.0.0.1.1 octet_str "baz"

.1.0.0.1.2 uinteger 54321

.1.0.0.1.3 uinteger 234

It also started as network exposing format, so it doesn’t express really well another kind of metrics.

The Prometheus exposition format is extremely valuable and I recently instrumented a legacy application using the Prometheus sdk and my code looks a lot cleaner and readable.

In the beginning, I was using logs as a transport layer for my metrics and time series but I ended up having a lot of spam in log themselves because I was also streaming a lot of “not logs but metrics” garbage.

The link to the Prometheus doc above is the best place to start, here I am just copying pasting something from there:

# HELP http_requests_total The total number of HTTP requests.

# TYPE http_requests_total counter

http_requests_total{method="post",code="200"} 1027 1395066363000

http_requests_total{method="post",code="400"} 3 1395066363000 # Escaping in label values:

msdos_file_access_time_seconds{path="C:\\DIR\\FILE.TXT",error="Cannot find file:

\"FILE.TXT\""} 1.458255915e9 # Minimalistic line:

metric_without_timestamp_and_labels 12.47 # A weird metric from before the epoch:

something_weird{problem="division by zero"} +Inf -3982045 # A histogram, which has a pretty complex representation in the text format:

# HELP http_request_duration_seconds A histogram of the request duration.

# TYPE http_request_duration_seconds histogram

http_request_duration_seconds_bucket{le="0.05"} 24054

http_request_duration_seconds_bucket{le="0.1"} 33444

http_request_duration_seconds_bucket{le="0.2"} 100392

http_request_duration_seconds_bucket{le="0.5"} 129389

http_request_duration_seconds_bucket{le="1"} 133988

http_request_duration_seconds_bucket{le="+Inf"} 144320

http_request_duration_seconds_sum 53423

http_request_duration_seconds_count 144320 # Finally a summary, which has a complex representation, too:

# HELP rpc_duration_seconds A summary of the RPC duration in seconds.

# TYPE rpc_duration_seconds summary

rpc_duration_seconds{quantile="0.01"} 3102

rpc_duration_seconds{quantile="0.05"} 3272

rpc_duration_seconds{quantile="0.5"} 4773

rpc_duration_seconds{quantile="0.9"} 9001

rpc_duration_seconds{quantile="0.99"} 76656

rpc_duration_seconds_sum 1.7560473e+07

rpc_duration_seconds_count 2693

Think about that not as the Prometheus way to grab metrics, but as the language that your application uses to teach the outside world how does it feel.

It is just a plain text entry point over HTTP that everyone can parse and re-use.

For example, kapacitor or telegraf have specific ways to parse and extract metrics from that URL.

If you don’t have time to write a parser for that you can use prom2json to get a JSON version of that.

In Go you can dig a bit more inside that code and reuse some of the functions for example:

// FetchMetricFamilies retrieves metrics from the provided URL, decodes them

// into MetricFamily proto messages, and sends them to the provided channel. It

// returns after all MetricFamilies have been sent.

func FetchMetricFamilies(

url string, ch chan<- *dto.MetricFamily,

certificate string, key string,

skipServerCertCheck bool,

) error {

defer close(ch)

var transport *http.Transport

if certificate != "" && key != "" {

cert, err := tls.LoadX509KeyPair(certificate, key)

if err != nil {

return err

}

tlsConfig := &tls.Config{

Certificates: []tls.Certificate{cert},

InsecureSkipVerify: skipServerCertCheck,

}

tlsConfig.BuildNameToCertificate()

transport = &http.Transport{TLSClientConfig: tlsConfig}

} else {

transport = &http.Transport{

TLSClientConfig: &tls.Config{InsecureSkipVerify: skipServerCertCheck},

}

}

client := &http.Client{Transport: transport}

return decodeContent(client, url, ch)

}

FetchMetricsFamilies can be used to get a channel with all the fetched metrics. When you have the channel you can make what you desire:

mfChan := make(chan *dto.MetricFamily, 1024) go func() {

err := prom2json.FetchMetricFamilies(flag.Args()[0], mfChan, *cert, *key, *skipServerCertCheck)

if err != nil {

log.Fatal(err)

}

}() result := []*prom2json.Family{}

for mf := range mfChan {

result = append(result, prom2json.NewFamily(mf))

}

As you can see prom2json converts the result to JSON.

It is pretty flexible! And it is a common API to read application status. A common API we all know means automation! Dope automation!

Future

The Prometheus exposition format grown in adoption across the board and a couple of people led by Richard are now pushing to have this format as new Internet Standard!

The project is called OpenMetrics and it is a Sandbox project under CNCF.

if you are looking to follow the project here the official repository on GitHub.

Probably it looks just a political step with no value at all from a tech point of view but I bet when it will be a standard and not just “the prometheus exposition” we will start to have routers exposing stats over http://192.168.1.1/metrics and it will be a lot of fun!