fix: Defer Gitlab::Instrumentation::ConnectionPool until after reinitalize_on_pid_change (!142042) · Merge requests · GitLab.org / GitLab

Igor requested to merge igor-prometheus-force-reinitialize-on-pid-change into master Jan 17, 2024

What does this MR do and why?

While looking at puma worker metrics, some of them appear to be missing. In particular gauges from RubySampler, e.g. ruby_file_descriptors (thanos). We only have values for the pid puma_master, none for puma_0, puma_1, etc.

The corresponding prometheus mmap file prefix is gauge_all. It appears we are not properly resetting MmapedValue after forking the puma worker. Thus, workers continue writing to the mmapped file belonging to the puma_master, and likely competing / continuously overwriting each other.

Starting with Ruby 3.1, connection_pool performs logic directly on fork:

https://github.com/mperham/connection_pool/blob/f83b6304c0e5936b1b286b26a73f3febda051c9b/lib/connection_pool.rb#L69-L74

This is triggered by puma here:

https://github.com/puma/puma/blob/v6.4.0/lib/puma/cluster.rb#L99

If we instrument before the fork, then instrumentation logic will get called on fork, which causes metrics to be re-initialized with the wrong pid.

Metrics continue to be attributed to the puma_master, because we do not yet have the process name set:

This process name is needed for proper attribution by PidProvider.

By deferring the instrumentation, we ensure no prometheus metrics are touched until process name is set, and we can explicitly reinitialize via reinitialize_on_pid_change.

This allows the gauge metrics to be correctly attributed to puma_1, puma_2, etc.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

Before	After

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

Start rails in gdk. Wait a minute for PumaSampler to kick in.
Check tmp/prometheus_multiproc_dir/puma, there is no gauge_all_puma_0-0.db.
Check curl -s 127.0.0.1:3000/-/metrics | grep '^ruby_file_descriptors', there is only puma_master.
Restart rails in gdk. Wait a minute for PumaSampler to kick in.
Check tmp/prometheus_multiproc_dir/puma, we now have gauge_all_puma_0-0.db.
Check curl -s 127.0.0.1:3000/-/metrics | grep '^ruby_file_descriptors', we now have pids puma_0, puma_1, puma_master.

refs #367527 (closed)

Edited Jan 18, 2024 by Igor

fix: Defer Gitlab::Instrumentation::ConnectionPool until after reinitalize_on_pid_change

What does this MR do and why?

MR acceptance checklist

Screenshots or screen recordings

How to set up and validate locally

Merge request reports