Per-process Prometheus metrics for Sidekiq are missing for all but one process
### Summary
When multiple Sidekiq processes are present, the last one to start will delete the per-process metrics for all other processes.
<details>
<summary>Expand for list of impacted metrics</summary>
<pre>
gitlab_database_connection_pool_busy
gitlab_database_connection_pool_connections
gitlab_database_connection_pool_dead
gitlab_database_connection_pool_idle
gitlab_database_connection_pool_size
gitlab_database_connection_pool_waiting
gitlab_ruby_threads_max_expected_threads
gitlab_ruby_threads_running_threads
ruby_file_descriptors
ruby_gc_stat_compact_count
ruby_gc_stat_count
ruby_gc_stat_heap_allocatable_pages
ruby_gc_stat_heap_allocated_pages
ruby_gc_stat_heap_available_slots
ruby_gc_stat_heap_eden_pages
ruby_gc_stat_heap_final_slots
ruby_gc_stat_heap_free_slots
ruby_gc_stat_heap_live_slots
ruby_gc_stat_heap_marked_slots
ruby_gc_stat_heap_sorted_length
ruby_gc_stat_heap_tomb_pages
ruby_gc_stat_major_gc_count
ruby_gc_stat_malloc_increase_bytes
ruby_gc_stat_malloc_increase_bytes_limit
ruby_gc_stat_minor_gc_count
ruby_gc_stat_old_objects
ruby_gc_stat_old_objects_limit
ruby_gc_stat_oldmalloc_increase_bytes
ruby_gc_stat_oldmalloc_increase_bytes_limit
ruby_gc_stat_remembered_wb_unprotected_objects
ruby_gc_stat_remembered_wb_unprotected_objects_limit
ruby_gc_stat_total_allocated_objects
ruby_gc_stat_total_allocated_pages
ruby_gc_stat_total_freed_objects
ruby_gc_stat_total_freed_pages
ruby_process_cpu_seconds_total
ruby_process_max_fds
ruby_process_proportional_memory_bytes
ruby_process_resident_memory_bytes
ruby_process_start_time_seconds
ruby_process_unique_memory_bytes
sidekiq_concurrency
sidekiq_current_rss
sidekiq_memory_killer_hard_limit_rss
sidekiq_memory_killer_phase
sidekiq_memory_killer_soft_limit_rss
sidekiq_running_jobs
</details>
</pre>
This is very similar to #37387.
### Steps to reproduce
- Create multiple Sidekiq processes in `gitlab.rb`
```rb
sidekiq['queue_selector'] = true
sidekiq['queue_groups'] = [
'*',
'*',
'*'
]
```
- Start Sidekiq
- Within a few seconds the all but one of the `gauge_all_sidekiq_N-N.db` and `gauge_max_sidekiq_N-N.db` files will be removed from `/run/gitlab/sidekiq`
### What is the current *bug* behavior?
The last Sidekiq process to start will delete the `gauge_all_sidekiq_N-N.db` and `gauge_max_sidekiq_N-N.db` files owned by the other Sidekiq processes.
```sh
$ curl --silent localhost:8082/metrics | grep "^sidekiq_current_rss"
sidekiq_current_rss{pid="sidekiq_2"} 930432
```
### What is the expected *correct* behavior?
Sidekiq processes do not delete each other's metrics.
```sh
$ curl --silent localhost:8082/metrics | grep "^sidekiq_current_rss"
sidekiq_current_rss{pid="sidekiq_0"} 985476
sidekiq_current_rss{pid="sidekiq_1"} 977452
sidekiq_current_rss{pid="sidekiq_2"} 979264
```
### Relevant logs and/or screenshots
<!-- Paste any relevant logs - please use code blocks (```) to format console output, logs, and code
as it's tough to read otherwise. -->
### Output of checks
#### Results of GitLab environment info
<!-- Input any relevant GitLab environment information if needed. -->
<details>
<summary>Expand for output related to GitLab environment info</summary>
<pre>
System information
System: Debian 10
Proxy: no
Current User: git
Using RVM: no
Ruby Version: 2.7.2p137
Gem Version: 3.1.4
Bundler Version:2.1.4
Rake Version: 13.0.3
Redis Version: 6.0.14
Git Version: 2.32.0
Sidekiq Version:5.2.9
Go Version: unknown
GitLab information
Version: 14.0.5-ee
Revision: b044f06e4dd
Directory: /opt/gitlab/embedded/service/gitlab-rails
DB Adapter: PostgreSQL
DB Version: 12.6
Elasticsearch: no
Geo: no
Using LDAP: no
Using Omniauth: yes
Omniauth Providers:
GitLab Shell
Version: 13.19.0
Repository storage paths:
- default: /var/opt/gitlab/git-data/repositories
GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell
Git: /opt/gitlab/embedded/bin/git
</pre>
</details>
#### Results of GitLab application Check
<!-- Input any relevant GitLab application check information if needed. -->
<details>
<summary>Expand for output related to the GitLab application check</summary>
<pre>
Checking GitLab subtasks ...
Checking GitLab Shell ...
GitLab Shell: ... GitLab Shell version >= 13.19.0 ? ... OK (13.19.0)
Running /opt/gitlab/embedded/service/gitlab-shell/bin/check
Internal API available: OK
Redis available via internal API: OK
gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Gitaly ...
Gitaly: ... default ... OK
Checking Gitaly ... Finished
Checking Sidekiq ...
Sidekiq: ... Running? ... yes
Number of Sidekiq processes (cluster/worker) ... 1/3
Checking Sidekiq ... Finished
Checking Incoming Email ...
Incoming Email: ... Reply by email is disabled in config/gitlab.yml
Checking Incoming Email ... Finished
Checking LDAP ...
LDAP: ... LDAP is disabled in config/gitlab.yml
Checking LDAP ... Finished
Checking GitLab App ...
Git configured correctly? ... yes
Database config exists? ... yes
All migrations up? ... yes
Database contains orphaned GroupMembers? ... no
GitLab config exists? ... yes
GitLab config up to date? ... yes
Log directory writable? ... yes
Tmp directory writable? ... yes
Uploads directory exists? ... yes
Uploads directory has correct permissions? ... yes
Uploads directory tmp has correct permissions? ... skipped (no tmp uploads folder yet)
Init script exists? ... skipped (omnibus-gitlab has no init script)
Init script up-to-date? ... skipped (omnibus-gitlab has no init script)
Projects have namespace: ...
2/1 ... yes
Redis version >= 5.0.0? ... yes
Ruby version >= 2.7.2 ? ... yes (2.7.2)
Git version >= 2.31.0 ? ... yes (2.32.0)
Git user has default SSH configuration? ... yes
Active users: ... 1
Is authorized keys file accessible? ... yes
GitLab configured to store new projects in hashed storage? ... yes
All projects are in hashed storage? ... yes
Elasticsearch version 7.x (6.4 - 6.x deprecated to be removed in 13.8)? ... skipped (elasticsearch is disabled)
Checking GitLab App ... Finished
Checking GitLab subtasks ... Finished
</pre>
</details>
### Possible fixes
It seems likely that this was introduced by !53139, which moved `Prometheus::CleanupMultiprocDirService` into the `Gitlab::Cluster::LifecycleEvents.on_master_start` section. Instrumenting [line 45 of the metrics initializer](https://gitlab.com/gitlab-org/gitlab/-/blob/v14.0.5-ee/config/initializers/7_prometheus_metrics.rb#L45) with:
```rb
if Gitlab::Runtime.sidekiq?
File.open('/tmp/cleanup_metrics.log', 'a') { |f| f.write("#{Time.now} pid #{Process.pid} - Execute CleanupMultiprocDir\n") }
end
```
We see that the `CleanupMultiprocDirService` is being executed once per each of the three Sidekiq worker processes on this node:
```
2021-07-16 03:28:04 +0000 pid 26155 - Execute CleanupMultiprocDir
2021-07-16 03:28:04 +0000 pid 26157 - Execute CleanupMultiprocDir
2021-07-16 03:28:05 +0000 pid 26159 - Execute CleanupMultiprocDir
```
Checking these processes with `lsof` we see that first two have had their files deleted:
```sh
$ lsof -p $(pgrep -fd, sidekiq) | grep "\.db.*deleted"
bundle 26155 git 7wW REG 0,21 4096 132607 /run/gitlab/sidekiq/gauge_max_sidekiq_0-0.db (deleted)
bundle 26155 git 8uW REG 0,21 4096 132607 /run/gitlab/sidekiq/gauge_max_sidekiq_0-0.db (deleted)
bundle 26155 git 9wW REG 0,21 16384 132608 /run/gitlab/sidekiq/gauge_all_sidekiq_0-0.db (deleted)
bundle 26155 git 10uW REG 0,21 16384 132608 /run/gitlab/sidekiq/gauge_all_sidekiq_0-0.db (deleted)
bundle 26157 git 7wW REG 0,21 4096 133470 /run/gitlab/sidekiq/gauge_max_sidekiq_1-0.db (deleted)
bundle 26157 git 8uW REG 0,21 4096 133470 /run/gitlab/sidekiq/gauge_max_sidekiq_1-0.db (deleted)
bundle 26157 git 9wW REG 0,21 16384 133471 /run/gitlab/sidekiq/gauge_all_sidekiq_1-0.db (deleted)
bundle 26157 git 10uW REG 0,21 16384 133471 /run/gitlab/sidekiq/gauge_all_sidekiq_1-0.db (deleted)
```
/cc @alipniagov @mkaeppler
issue