At Synthesio, a part of our monitoring solution relies on Prometheus. We encountered a few issues while scaling it to meet our needs, and here is a write-up on one of these issues we participated in solving.

At Synthesio, we use Prometheus for metrics collecting, and MetricTank as its long-term storage. While this is a common enough setup, we generally operate things at a scale that makes scaling issues surface quickly.

At some point we noticed that the default remote_write configuration (from Prometheus to MetricTank) was losing datapoints along the way because it wasn’t able to keep up the pace.

At this moment, the load was light compared to what we expected to have later, so we looked at the configuration to fine-tune it. We increased the capacity of the queue and the number of shards (the number of parallel writes).

capacity: 100000 # default 10000 max_shards: 2000 # default 1000 max_samples_per_send: 200 # default 200

The datapoint loss stopped immediately, as well as a few issues we had about out-of-memory processes. This was left alone for a few months so we could work on other things and progressively increase the number of datapoints to process.

All this was good, until we added prometheus metrics to all our services during a three-days long monitoring frenzy. This resulted in our two-instances Prometheus setup having each to handle 1.2M additionnal timeseries overnight among 600+ targets.

We soon noticed an unusual CPU load on the machines, averaging around 40% with huge spikes from time to time. A quick investigation showed that most of this usage was caused by system CPU usage, which seemed unexpected given that Prometheus operates globally from the userland and is not supposed to stress the kernel that much.

To get more information, we whipped out perf.

$ perf top Samples: 5M of event 'cycles:ppp', 4000 Hz, Event count (approx.): 667022966996 Overhead Shared Object Symbol 19.85% [kernel] [k] __inet_check_established 15.13% [kernel] [k] _raw_spin_lock 10.89% [kernel] [k] __inet_hash_connect 9.01% [kernel] [k] _raw_spin_lock_bh 7.49% [kernel] [k] inet_ehashfn 5.31% [kernel] [k] tcp_twsk_unique 4.42% [kernel] [k] native_queued_spin_lock_slowpath 2.20% [kernel] [k] __indirect_thunk_start 1.96% [kernel] [k] ___bpf_prog_run 1.81% [kernel] [k] __local_bh_enable_ip 0.54% [kernel] [k] _cond_resched 0.54% [kernel] [k] rcu_all_qs 0.46% [kernel] [k] _raw_spin_unlock_bh 0.23% [kernel] [k] __entry_trampoline_start 0.21% [kernel] [k] syscall_return_via_sysret 0.18% [kernel] [k] __bpf_prog_run32 0.17% [kernel] [k] seccomp_run_filters 0.16% [kernel] [k] vsnprintf 0.16% prometheus [.] 0x00000000000722ed

It is immediately obvious that the top 5-7 symbols eating up CPU are somewhat related to the issue at hand, and the very first line was enough to set us on the right track. The __inet_check_established method does loop over the hash table of connections. This table’s name in the kernel is TCP established hash table even though it contains connections in other states. This table is used to locate an existing connection, for example when receiving a new segment.

And here was our first lead. The TCP_TIME_WAIT state is the state during which the kernel doesn’t allow a port to be re-used by a connection to ensure that all packets transiting on that port are evacuated after closing a connection. The default timeout varies on the network stack configuration: having lots of entries generally means that you’re doing a lot of short-lived connections. Which makes kinda sense for a process that would connect to a lot of different targets (we have 600+ prometheus targets, remember?) but not so for the connections to our LTS.

We did a few statistics about the number of opened connections using ss and noticed an important number of connection toward the LTS.

ss -ta state TIME-WAIT \ | awk '{print $(NF)" "$(NF-1)}' \ | sed 's/:[^ ]*//g' \ | tail -n +2 \ | sort \ | uniq -c

We did a second round of tuning to get the situation under control and give us the time to understand the root of the issue. At this point, the instance were stable but at the price of huge memory consumption that wouldn’t be scalable in the long-term. Here is the remote_write configuration:

capacity: 100000 # default 10000 max_shards: 2000 # default 1000 max_samples_per_send: 100000 # default 200

We audited the connections from Prometheus to MetricTank to understand what was going on by doing a tcpdump of the said connections and looking at it using wireshark.

The thing that was immediately evident was that all connections only ever transmitted two packets: one query to send a batch of datapoints, and the response from the server. Which is bad, because it means that everytime Prometheus sends datapoints to the LTS, it does so in a new TCP connection, which is costly.

The good news is that HTTP/1.1’s keep-alive feature is a simple solution to this, and has existed since a long time. The bad news is that if it isn’t used, either we misconfigured something or it isn’t supported by either Prometheus or MetricTank.

The TCP dump looks like Prometheus is effectively the one closing each connection after using it once (it is the first to send FIN), which is surprising. We checked by hand if MetricTank supports keep-alive and the answer was yes, so it is effectively an issue in Prometheus.

We did a quick audit of the code of the remote storage in Prometheus, and found out a common mistake with Golang and HTTP requests that prevented the TCP connection being re-used immediately. This means that everytime Prometheus wanted to send datapoints to the LTS, it effectively had to open a new connection, hence the observed behavior.

The patch itself is simple, and a simple dump-and-compare operation proved it was working as intended by allowing connection re-use.

The patch is included in Prometheus 2.9.2 and reduces the CPU consumption by reducing the number of opened connections. As you could guess by looking at the following graph, we deployed it on our own instances the 25th around 12:00.

Et voila!