Photo by Shane Aldendorff from Pexels

When I first started using Prometheus, I found the documentation very useful when it comes to describing features of the library. I was, however, a little bit lost when it comes to bigger picture. What to monitor? What is the best method to monitor specific cases? Back then, I would really appreciate a common set of metrics for typical systems that most of us develop on a daily.

In this post I would like to present how to create such a monitoring solution.

What to monitor?

The very first thing to do, when building a monitoring system is deciding what should actually be monitored.

Over the time and a few projects I learned that items worth monitoring are:

Inputs to the system

Outputs from the system

Important/time-consuming operations within the system

Resources

The most common inputs to system are:

HTTP endpoints

Message broker consumers (e.g. Kafka Consumer)

Reading from databases/files/services

The most common outputs from the system are:

Saving to databases/files

Calling external services

Producing to a message broker

Examples of important/time-consuming operations within system are:

One cycle of the batch processing (e.g. once a day you check all users transactions and make a report)

Creating and signing a bitcoin transaction

So what I usually do is to find all inputs and outputs to my system as well as all long running operations within it.

Subject of monitoring

With this brief introduction, let’s actually take a look how to create monitoring system for a social media app. This app has the following components:

HTTP endpoints POST /users and GET /users/{id} . Used for registering new user and getting their summary.

and . Used for registering new user and getting their summary. PostgreSQL users table. Used for saving and reading user data.

table. Used for saving and reading user data. Auth service. Called when any endpoint is triggered to validate permissions.

Kafka topic posts . Used for displaying posts created by friends a user is connected with.

. Used for displaying posts created by friends a user is connected with. Time-consuming process of generating video for friends who have known each other for a couple years. Triggered every night.

Incoming HTTP requests

There are many libraries for integrating HTTP server-side frameworks with prometheus. At SoftwareMill we mainly use Akka HTTP so we go with Prometheus Akka HTTP.

Such libraries usually expose a Prometheus histogram and allow you to add labels with the names of the endpoints.

pathPrefix("users") {

recordResponseTime("post_user_endpoint_label") {

(post & pathEndOrSingleSlash) { /*….*/ }

} ~

recordResponseTime("get_user_endpoint_label") {

(get & path(JavaUUID) ) { /* …. */ }

}

}

A Prometheus histogram exposes two metrics: count and sum of duration.

Having such data we can plot requests per second and average request duration time.

Requests per second (all endpoints combined — all labels are aggregated with sum):

sum(rate(http_request_duration_count[1m]))

Average requests duration (all endpoints combined — all labels are aggregated with avg):

avg(rate(http_request_duration_sum[1m])/rate(http_request_duration_count[1m]))

Requests per second (different graph for each endpoint — label):

rate(http_request_duration_count[1m])

Requests per second (different graph for each endpoint — label):

rate(http_request_duration_sum[1m])/rate(http_request_duration_count[1m])

Database

There are Prometheus metrics exporters for databases. Often, however, things can get stuck on a client side so I find measuring queries from client very handy.

There are many different databases and libraries used to interact with them. Most of them provide you with some kind of generic callback that gives you information about queries and their duration. You can start a timer there and observe the duration. I also recommend adding a label for this metric with query statement itself. This will allow to find which kind of queries tend to be the most time-consuming. If you use this approach remember to remove the query parameters in order to have label per query type, instead of millions of useless labels.

In our case we use PostgreSQL and interact with it using ScalikeJDBC. This is how you can hook it up using such library:

Histogram

.build()

.name(“sql_duration”)

.help(“Sql statement duration in seconds.”)

.labelNames(“statement”)

.register() GlobalSettings.queryCompletionListener = (sql: String, params: Seq[Any], millis: Long) => {

val sqlWithoutParamList = sql

.replaceAll(“, \\?”, “”)

.replaceAll(“JOIN \\(VALUES .* vals”, “JOIN VALUES (?) vals”)

sqlHistogram

.labels(sqlWithoutParamList)

.observe(millis / 1000.0)

}

Exactly the same method can be used to monitor HTTP requests to measure queries count and average duration. Just instead of http_request_duration use sql_duration .

Kafka

When it comes to measuring Kafka consumer, it’s simply a matter of starting a timer before handling message and closing it afterwards. Prometheus will measure time automatically and expose the metrics. It’s a good idea to add a label for a topic. Currently we only have just one topic but there might be more introduced in the future.

val kafkaHistogram = Histogram

.build(“kafka_message_duration”, “Kafka message handling duration”)

.labelNames(“topic”)

.register() def handleMessage(message: Message) = {

val topic = message.record.topic

val timer = kafkaHistogram.labels(topic).startTimer()

handle(message)

timer.close()

}

Querying kafka messages received per second and average handling duration is the same as shown previously — just with different metrics names.

Calls to an external service

There are libraries for integrating an HTTP client but it can be also achieved manually using a timer. We use STTP, which indeed has a Prometheus plugin, which exposes histogram. Having histogram data we can perform the same monitoring techniques for measuring average request time and requests amount.

Generating video

For daily videos generation we can use a similar strategy. For such task we need to manually start and stop timer similarly to Kafka scenario:

val videoHistogram = Histogram

.build(“friends_video”, “Friends video generation duration”)

.register() def createVideos = {

val friends = db.fetchUsersBeingFriendsOverFiveYears()

friends.foreach { friendsPair =>

val timer = videoHistogram.startTimer()

videoGenerator.generae(friendsPair)

timer.close()

}

}

Since video generation will take some time it’s pointless to measure “videos per second”. Instead, I would suggest plotting just the amount of videos generated over time. This way we can see how many videos were generated in one batch and see how long the batch took. It’s as simple as querying

friends_videos_count metrics.

Resources

Resources is the easiest part. By using Node Exporter we get all the resource-related metrics. There are also predefined Grafana dashboards which I highly recommend for finding out which metrics to use and what queries to perform.

Visualizing

For visualizing I recommend using Grafana. It’s especially useful when you have complicated metrics with lots of labels. You can then create drop-down menus and choose which labels values you want to select. The best way to learn it is just find a good dashboard and see how it’s done under the hood. Go to https://grafana.com/dashboards, choose Prometheus as data source, browse available dashboards and you will learn how to use to its’ full potential in just a few minutes.

Summary

In this post I presented how to build a Prometheus monitoring solution for components used in an ordinary system. There are, however, lots of different ways to approach such task. If you know any of them, we would love to hear it.