Multiprocess Prometheus metrics can be corrupted during forking
We suspect that in the Ruby 3.1 rollout (https://gitlab.com/gitlab-org/gitlab/-/issues/430703#note_1637106542), we observed that Prometheus metrics weren't being reported. We looked at the .db file and observed an extraneous NUL byte had replaced the n in db_config_name:
00001f90 00 00 00 00 00 00 00 00 b2 00 00 00 5b 22 67 69 |............["gi|
00001fa0 74 6c 61 62 5f 64 61 74 61 62 61 73 65 5f 63 6f |tlab_database_co|
00001fb0 6e 6e 65 63 74 69 6f 6e 5f 70 6f 6f 6c 5f 69 64 |nnection_pool_id|
00001fc0 6c 65 22 2c 22 67 69 74 6c 61 62 5f 64 61 74 61 |le","gitlab_data|
00001fd0 62 61 73 65 5f 63 6f 6e 6e 65 63 74 69 6f 6e 5f |base_connection_|
00001fe0 70 6f 6f 6c 5f 69 64 6c 65 22 2c 5b 22 63 6c 61 |pool_idle",["cla|
00001ff0 73 73 22 2c 22 64 62 5f 63 6f 6e 66 69 67 5f 00 |ss","db_config_.| <--- here
00002000 61 6d 65 22 2c 22 68 6f 73 74 22 2c 22 70 6f 72 |ame","host","por|
00002010 74 22 5d 2c 5b 22 43 69 3a 3a 41 70 70 6c 69 63 |t"],["Ci::Applic|
00002020 61 74 69 6f 6e 52 65 63 6f 72 64 22 2c 22 63 69 |ationRecord","ci|
00002030 5f 72 65 70 6c 69 63 61 22 2c 22 31 30 2e 31 34 |_replica","10.14|
00002040 32 2e 30 2e 36 32 22 2c 6e 75 6c 6c 5d 5d 20 20 |2.0.62",null]] |
00002050 00 00 00 00 00 00 f0 3f b8 00 00 00 5b 22 67 69 |.......?....["gi|
We see that the NUL byte happened right at the 8192-byte boundary. We know that expand_file in https://gitlab.com/gitlab-org/ruby/gems/prometheus-client-mmap/-/blob/b3be01cb56f48bd31c2d545edfb23758426f286c/ext/fast_mmaped_file_rs/src/mmap.rs#L496-522 grows the file by writing a NUL byte.
This is likely related to a similar issue fixed in gitlab-org/ruby/gems/prometheus-client-mmap!130 (merged), where the forked Puma worker and the parent worker both attempt to write to the same file. However, since they have different internal states as to how big the file is, the file gets corrupted: one process writes to the file while another expands the file, causing the NUL byte to overwrite what was there.
Upon review, I believe this might be happening:
- Puma forks.
- The
ForkTracker#after_forkcallbacks run and shuts down all connections. - The Redis connection pool closes all connections.
- Our Redis store calls
use_primary_and_secondary_stores?, which checks a feature flag in the DB. - This initiates the load balancing
HostList., which creates a gauge in the metrics. - This forked worker has a
$PROCESS_NAMEofpuma, which makes it look like the master. Unfortunately this causes these metrics to write ingauge_all_puma_master-0.dbinstead ofgauge_max_puma_0-0.db.
I determined this by adding some print statements to see when prometheus-client-mmap opens a new file and writes values:
2023-11-07_19:21:00.91874 rails-web : === open_exclusive_file: /Users/stanhu/gdk-ee/gitlab/tmp/prometheus_multiproc_dir/puma/gauge_all_puma_master-1.db PROGRAM_NAME=/Users/stanhu/.asdf/installs/ruby/3.1.4/bin/puma PID=94594, Thread name: 2023-11-07_19:21:00.91891 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:119:in `unsafe_initialize_file' 2023-11-07_19:21:00.91892 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:107:in `block in initialize_file' 2023-11-07_19:21:00.91892 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:106:in `synchronize' 2023-11-07_19:21:00.91893 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:106:in `initialize_file' 2023-11-07_19:21:00.91893 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:45:in `block in set' 2023-11-07_19:21:00.91894 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:44:in `synchronize' 2023-11-07_19:21:00.91894 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/mmaped_value.rb:44:in `set' 2023-11-07_19:21:00.91894 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/prometheus-client-mmap-0.28.1-arm64-darwin/lib/prometheus/client/gauge.rb:28:in `set' 2023-11-07_19:21:00.91895 rails-web : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/host_list.rb:83:in `set_metrics!' 2023-11-07_19:21:00.91895 rails-web : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/host_list.rb:48:in `block in next' 2023-11-07_19:21:00.91895 rails-web : :90:in `tap' 2023-11-07_19:21:00.91899 rails-web : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/host_list.rb:47:in `next' 2023-11-07_19:21:00.91900 rails-web : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/load_balancer.rb:157:in `host' 2023-11-07_19:21:00.91900 rails-web : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/load_balancer.rb:58:in `read' 2023-11-07_19:21:00.91901 rails-web : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/database/load_balancing/connection_proxy.rb:67:in `schema_cache' 2023-11-07_19:21:00.91901 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/activerecord-7.0.8/lib/active_record/model_schema.rb:407:in `table_exists?' 2023-11-07_19:21:00.91902 rails-web : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/redis/multi_store.rb:270:in `feature_table_exists?' 2023-11-07_19:21:00.91903 rails-web : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/redis/multi_store.rb:203:in `use_primary_and_secondary_stores?' 2023-11-07_19:21:00.91904 rails-web : /Users/stanhu/gdk-ee/gitlab/lib/gitlab/redis/multi_store.rb:257:in `close' 2023-11-07_19:21:00.91904 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:63:in `block (2 levels) in after_fork' 2023-11-07_19:21:00.91904 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool/timed_stack.rb:147:in `shutdown_connections' 2023-11-07_19:21:00.91905 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool/timed_stack.rb:96:in `block in shutdown' 2023-11-07_19:21:00.91905 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool/timed_stack.rb:92:in `synchronize' 2023-11-07_19:21:00.91905 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool/timed_stack.rb:92:in `shutdown' 2023-11-07_19:21:00.91909 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:160:in `reload' 2023-11-07_19:21:00.91909 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:60:in `block in after_fork' 2023-11-07_19:21:00.91909 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:53:in `each' 2023-11-07_19:21:00.91910 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:53:in `after_fork' 2023-11-07_19:21:00.91910 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/connection_pool-2.4.1/lib/connection_pool.rb:74:in `_fork' 2023-11-07_19:21:00.91910 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/activesupport-7.0.8/lib/active_support/fork_tracker.rb:7:in `_fork' 2023-11-07_19:21:00.91910 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:99:in `fork' 2023-11-07_19:21:00.91910 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:99:in `spawn_worker' 2023-11-07_19:21:00.91911 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:81:in `block in spawn_workers' 2023-11-07_19:21:00.91911 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:74:in `times' 2023-11-07_19:21:00.91911 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:74:in `spawn_workers' 2023-11-07_19:21:00.91911 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/cluster.rb:416:in `run' 2023-11-07_19:21:00.91912 rails-web : /Users/stanhu/.asdf/installs/ruby/3.1.4/lib/ruby/gems/3.1.0/gems/puma-6.4.0/lib/puma/launcher.rb:194:in `run'
The easiest fix may be to append the PID in the PID provider:
diff --git a/lib/prometheus/pid_provider.rb b/lib/prometheus/pid_provider.rb
index 4d38e9513fb0..a614948cff83 100644
--- a/lib/prometheus/pid_provider.rb
+++ b/lib/prometheus/pid_provider.rb
@@ -28,7 +28,7 @@ def puma_worker_id
if matches = process_name.match(/puma.*cluster worker ([0-9]+):/)
"puma_#{matches[1]}"
elsif process_name.include?('puma')
- "puma_master"
+ "puma_master_pid_#{Process.pid}"
else
unknown_process_id
end
This has the downside of initializing an extra .db file per type/process that has mostly empty values. I should note that Gitlab::Metrics::RailsSlis.initialize_request_slis! is called by lib/gitlab/metrics/requests_rack_middleware.rb, and this adds over 7,577 metrics:
bin/parse -t /Users/stanhu/gdk-ee/gitlab/tmp/prometheus_multiproc_dir/puma/counter_puma_master_pid_89360-0.db | grep sli | wc -l
7577
Some ideas to fix this:
- For the Puma master, add the PID value to the PID provider.
- Add a
before_worker_forkcallback that disables metrics untilafter_worker_forkfinishes. - Add a
AfterForktracker that sets$0to something other thanpuma. - Register the parent PID in the PID provider. Don't rely on
$PROGRAM_NAMEfor this.