Multiprocess Prometheus metrics can be corrupted during forking

We suspect that in the Ruby 3.1 rollout (https://gitlab.com/gitlab-org/gitlab/-/issues/430703#note_1637106542), we observed that Prometheus metrics weren't being reported. We looked at the .db file and observed an extraneous NUL byte had replaced the n in db_config_name:

00001f90  00 00 00 00 00 00 00 00  b2 00 00 00 5b 22 67 69  |............["gi|
00001fa0  74 6c 61 62 5f 64 61 74  61 62 61 73 65 5f 63 6f  |tlab_database_co|
00001fb0  6e 6e 65 63 74 69 6f 6e  5f 70 6f 6f 6c 5f 69 64  |nnection_pool_id|
00001fc0  6c 65 22 2c 22 67 69 74  6c 61 62 5f 64 61 74 61  |le","gitlab_data|
00001fd0  62 61 73 65 5f 63 6f 6e  6e 65 63 74 69 6f 6e 5f  |base_connection_|
00001fe0  70 6f 6f 6c 5f 69 64 6c  65 22 2c 5b 22 63 6c 61  |pool_idle",["cla|
00001ff0  73 73 22 2c 22 64 62 5f  63 6f 6e 66 69 67 5f 00  |ss","db_config_.| <--- here
00002000  61 6d 65 22 2c 22 68 6f  73 74 22 2c 22 70 6f 72  |ame","host","por|
00002010  74 22 5d 2c 5b 22 43 69  3a 3a 41 70 70 6c 69 63  |t"],["Ci::Applic|
00002020  61 74 69 6f 6e 52 65 63  6f 72 64 22 2c 22 63 69  |ationRecord","ci|
00002030  5f 72 65 70 6c 69 63 61  22 2c 22 31 30 2e 31 34  |_replica","10.14|
00002040  32 2e 30 2e 36 32 22 2c  6e 75 6c 6c 5d 5d 20 20  |2.0.62",null]]  |
00002050  00 00 00 00 00 00 f0 3f  b8 00 00 00 5b 22 67 69  |.......?....["gi|

We see that the NUL byte happened right at the 8192-byte boundary. We know that expand_file in https://gitlab.com/gitlab-org/ruby/gems/prometheus-client-mmap/-/blob/b3be01cb56f48bd31c2d545edfb23758426f286c/ext/fast_mmaped_file_rs/src/mmap.rs#L496-522 grows the file by writing a NUL byte.

This is likely related to a similar issue fixed in gitlab-org/ruby/gems/prometheus-client-mmap!130 (merged), where the forked Puma worker and the parent worker both attempt to write to the same file. However, since they have different internal states as to how big the file is, the file gets corrupted: one process writes to the file while another expands the file, causing the NUL byte to overwrite what was there.

Upon review, I believe this might be happening:

Puma forks.
The ForkTracker#after_fork callbacks run and shuts down all connections.
The Redis connection pool closes all connections.
Our Redis store calls use_primary_and_secondary_stores?, which checks a feature flag in the DB.
This initiates the load balancing HostList., which creates a gauge in the metrics.
This forked worker has a $PROCESS_NAME of puma, which makes it look like the master. Unfortunately this causes these metrics to write in gauge_all_puma_master-0.db instead of gauge_max_puma_0-0.db.

I determined this by adding some print statements to see when prometheus-client-mmap opens a new file and writes values:

2023-11-07_19:21:00.91874 rails-web             : === open_exclusive_file: /Users/stanhu/gdk-ee/gitlab/tmp/prometheus_multiproc_dir/puma/gauge_all_puma_master-1.db PROGRAM_NAME=/Users/stanhu/.asdf/installs/ruby/3.1.4/bin/puma PID=94594, Thread name:
2023-11-07_19:21:00.91891 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:119:in `unsafe_initialize_file'
2023-11-07_19:21:00.91892 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:107:in `block in initialize_file'
2023-11-07_19:21:00.91892 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:106:in `synchronize'
2023-11-07_19:21:00.91893 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:106:in `initialize_file'
2023-11-07_19:21:00.91893 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:45:in `block in set'
2023-11-07_19:21:00.91894 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:44:in `synchronize'
2023-11-07_19:21:00.91894 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:44:in `set'
2023-11-07_19:21:00.91894 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/gauge.rb:28:in `set'
2023-11-07_19:21:00.91895 rails-web             : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/host_list.rb:83:in `set_metrics!'
2023-11-07_19:21:00.91895 rails-web             : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/host_list.rb:48:in `block in next'
2023-11-07_19:21:00.91895 rails-web             : :90:in `tap'
2023-11-07_19:21:00.91899 rails-web             : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/host_list.rb:47:in `next'
2023-11-07_19:21:00.91900 rails-web             : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/load_balancer.rb:157:in `host'
2023-11-07_19:21:00.91900 rails-web             : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/load_balancer.rb:58:in `read'
2023-11-07_19:21:00.91901 rails-web             : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/connection_proxy.rb:67:in `schema_cache'
2023-11-07_19:21:00.91901 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/activerecord-7.0.8/lib/active_record/model_schema.rb:407:in `table_exists?'
2023-11-07_19:21:00.91902 rails-web             : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/redis/multi_store.rb:270:in `feature_table_exists?'
2023-11-07_19:21:00.91903 rails-web             : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/redis/multi_store.rb:203:in `use_primary_and_secondary_stores?'
2023-11-07_19:21:00.91904 rails-web             : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/redis/multi_store.rb:257:in `close'
2023-11-07_19:21:00.91904 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:63:in `block (2 levels) in after_fork'
2023-11-07_19:21:00.91904 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool/timed_stack.rb:147:in `shutdown_connections'
2023-11-07_19:21:00.91905 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool/timed_stack.rb:96:in `block in shutdown'
2023-11-07_19:21:00.91905 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool/timed_stack.rb:92:in `synchronize'
2023-11-07_19:21:00.91905 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool/timed_stack.rb:92:in `shutdown'
2023-11-07_19:21:00.91909 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:160:in `reload'
2023-11-07_19:21:00.91909 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:60:in `block in after_fork'
2023-11-07_19:21:00.91909 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:53:in `each'
2023-11-07_19:21:00.91910 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:53:in `after_fork'
2023-11-07_19:21:00.91910 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:74:in `_fork'
2023-11-07_19:21:00.91910 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/activesupport-7.0.8/lib/active_support/fork_tracker.rb:7:in `_fork'
2023-11-07_19:21:00.91910 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:99:in `fork'
2023-11-07_19:21:00.91910 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:99:in `spawn_worker'
2023-11-07_19:21:00.91911 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:81:in `block in spawn_workers'
2023-11-07_19:21:00.91911 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:74:in `times'
2023-11-07_19:21:00.91911 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:74:in `spawn_workers'
2023-11-07_19:21:00.91911 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:416:in `run'
2023-11-07_19:21:00.91912 rails-web             : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/launcher.rb:194:in `run'

The easiest fix may be to append the PID in the PID provider:

diff --git a/lib/prometheus/pid_provider.rb b/lib/prometheus/pid_provider.rb
index 4d38e9513fb0..a614948cff83 100644
--- a/lib/prometheus/pid_provider.rb
+++ b/lib/prometheus/pid_provider.rb
@@ -28,7 +28,7 @@ def puma_worker_id
       if matches = process_name.match(/puma.*cluster worker ([0-9]+):/)
         "puma_#{matches[1]}"
       elsif process_name.include?('puma')
-        "puma_master"
+        "puma_master_pid_#{Process.pid}"
       else
         unknown_process_id
       end

This has the downside of initializing an extra .db file per type/process that has mostly empty values. I should note that Gitlab::Metrics::RailsSlis.initialize_request_slis! is called by lib/gitlab/metrics/requests_rack_middleware.rb, and this adds over 7,577 metrics:

bin/parse -t /Users/stanhu/gdk-ee/gitlab/tmp/prometheus_multiproc_dir/puma/counter_puma_master_pid_89360-0.db  | grep sli | wc -l
    7577

Some ideas to fix this:

For the Puma master, add the PID value to the PID provider.
Add a before_worker_fork callback that disables metrics until after_worker_fork finishes.
Add a AfterFork tracker that sets $0 to something other than puma.
Register the parent PID in the PID provider. Don't rely on $PROGRAM_NAME for this.

Edited Nov 07, 2023 by Stan Hu