Scraping of metrics can result in deadlock of the process
Problem to solve
We hit a problem of Puma on web-cny-02
and on a single production node where the Puma:
- Was consuming 100% of CPU,
- Was unable to be hupped, restarted nor stopped,
- It was not accepting
SIGTERM
, and it was ignoring it.
We saw that:
-
/var/run/gitlab/puma
which holds prometheus metrics loose low cardinality, each metric had the PID assigned, and this resulted in the directory being of 130MB, - when we looked at Puma master process, we saw a multiple of threads executing
:8083
and queueing metrics scraping, - when we remove all
*.db
files from above folder, the 100% did go away, and process was at 1-2% as expected, - process was not responding to any signals,
It seems that the following did happen:
- !20294 (merged) did likely land on Canary,
- It caused us to loose low-cardinality of metrics, the
pid
beingpuma_0..15
, - This caused a ton of metrics to be emitted, that took longer than 10s as timeout defined by Prometheus,
- Each request was processed, this caused a contention on
/metrics
and fatal failure, - Likely signal was received at some point, but due to contention it got ignored, or exception was raised?
Steps
- We need to figure out exactly what is happening,
- We should be actively monitoring prometheus scrape time, if it gets near to 10s, it seems that the same can happen on Sidekiq/Puma/Unicorn,
- Given that prometheus metrics emit happens in Native code, it means that it blocks the whole process,
- It seems that the only way to resurrect that would be restart/or invalidate of all metrics for the whole Puma/Unicorn/Sidekiq to resurrect the metrics directory, or maybe we should figure out a way to invalidate metrics directory every some time?
Proposal
This in particular describes a case of deadlock happening for Puma
.
However, the same is applicable for any other case when we scrape metrics.
If we have scrape timeout set to 10s, and Prometheus starts enqueuing connections, they might simply deadlock the process, not allowing it to process anything.
The biggest downside of the current approach is that this blocks Ruby runtime for a long time, needed to execute the marshaling. Currently, there's no way to avoid that. The comment below proposes to introduce a script that would be executed to perform that instead.
Links / references
Edited by John Jarvis