Scraping of metrics can result in deadlock of the process

Problem to solve

We hit a problem of Puma on web-cny-02 and on a single production node where the Puma:

Was consuming 100% of CPU,
Was unable to be hupped, restarted nor stopped,
It was not accepting SIGTERM, and it was ignoring it.

We saw that:

/var/run/gitlab/puma which holds prometheus metrics loose low cardinality, each metric had the PID assigned, and this resulted in the directory being of 130MB,
when we looked at Puma master process, we saw a multiple of threads executing :8083 and queueing metrics scraping,
when we remove all *.db files from above folder, the 100% did go away, and process was at 1-2% as expected,
process was not responding to any signals,

It seems that the following did happen:

!20294 (merged) did likely land on Canary,
It caused us to loose low-cardinality of metrics, the pid being puma_0..15,
This caused a ton of metrics to be emitted, that took longer than 10s as timeout defined by Prometheus,
Each request was processed, this caused a contention on /metrics and fatal failure,
Likely signal was received at some point, but due to contention it got ignored, or exception was raised?

Steps

We need to figure out exactly what is happening,
We should be actively monitoring prometheus scrape time, if it gets near to 10s, it seems that the same can happen on Sidekiq/Puma/Unicorn,
Given that prometheus metrics emit happens in Native code, it means that it blocks the whole process,
It seems that the only way to resurrect that would be restart/or invalidate of all metrics for the whole Puma/Unicorn/Sidekiq to resurrect the metrics directory, or maybe we should figure out a way to invalidate metrics directory every some time?

Proposal

This in particular describes a case of deadlock happening for Puma. However, the same is applicable for any other case when we scrape metrics.

If we have scrape timeout set to 10s, and Prometheus starts enqueuing connections, they might simply deadlock the process, not allowing it to process anything.

The biggest downside of the current approach is that this blocks Ruby runtime for a long time, needed to execute the marshaling. Currently, there's no way to avoid that. The comment below proposes to introduce a script that would be executed to perform that instead.

#118839 (comment 274040136)

Links / references

Edited Feb 07, 2020 by John Jarvis