Run RubySampler and ThreadSampler in Puma primary
What does this MR do and why?
Our main source of information for Ruby memory use are the ruby_process_*
metrics collected via RubySampler
. This includes ruby_process_resident_memory_bytes
which represents process RSS.
However, this sampler currently only runs in Puma workers. This means that when taking aggregates in Thanos, such as summing up process RSS, the primary process is not accounted for:
Here, pids 0-5 are all Puma workers.
This provides a misleading picture of actual RSS allocated to Rails processes, especially when cross-referencing this data with memory killer events, since the Puma worker killer reaps workers based on total cluster RSS, not just worker RSS.
This MR makes sure we also run two samplers in the Puma primary process:
- RubySampler
- ThreadSampler
This is accomplished by stopping these samplers and re-creating them whenever a worker forks, so that they do not inherit metrics data from the primary accidentally.
Screenshots or screen recordings
We can look at Prometheus to see the new metrics coming from the puma_master
now:
How to set up and validate locally
- Make sure
ApplicationSettings#prometheus_metrics_enabled
is true - Start rails-web
- Interrogate which threads are running (e.g. via
Thread.list
) or look at metrics emitted via/-/metrics
(you cangrep
them forpuma_*
)
I also verified that after kill -TERM
-ing a worker, they will fork without problems, and that metrics still reset correctly.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #363833 (closed)