Per-process Prometheus metrics for Sidekiq are missing for all but one process
Summary
When multiple Sidekiq processes are present, the last one to start will delete the per-process metrics for all other processes.
Expand for list of impacted metrics
gitlab_database_connection_pool_busy gitlab_database_connection_pool_connections gitlab_database_connection_pool_dead gitlab_database_connection_pool_idle gitlab_database_connection_pool_size gitlab_database_connection_pool_waiting gitlab_ruby_threads_max_expected_threads gitlab_ruby_threads_running_threads ruby_file_descriptors ruby_gc_stat_compact_count ruby_gc_stat_count ruby_gc_stat_heap_allocatable_pages ruby_gc_stat_heap_allocated_pages ruby_gc_stat_heap_available_slots ruby_gc_stat_heap_eden_pages ruby_gc_stat_heap_final_slots ruby_gc_stat_heap_free_slots ruby_gc_stat_heap_live_slots ruby_gc_stat_heap_marked_slots ruby_gc_stat_heap_sorted_length ruby_gc_stat_heap_tomb_pages ruby_gc_stat_major_gc_count ruby_gc_stat_malloc_increase_bytes ruby_gc_stat_malloc_increase_bytes_limit ruby_gc_stat_minor_gc_count ruby_gc_stat_old_objects ruby_gc_stat_old_objects_limit ruby_gc_stat_oldmalloc_increase_bytes ruby_gc_stat_oldmalloc_increase_bytes_limit ruby_gc_stat_remembered_wb_unprotected_objects ruby_gc_stat_remembered_wb_unprotected_objects_limit ruby_gc_stat_total_allocated_objects ruby_gc_stat_total_allocated_pages ruby_gc_stat_total_freed_objects ruby_gc_stat_total_freed_pages ruby_process_cpu_seconds_total ruby_process_max_fds ruby_process_proportional_memory_bytes ruby_process_resident_memory_bytes ruby_process_start_time_seconds ruby_process_unique_memory_bytes sidekiq_concurrency sidekiq_current_rss sidekiq_memory_killer_hard_limit_rss sidekiq_memory_killer_phase sidekiq_memory_killer_soft_limit_rss sidekiq_running_jobs
This is very similar to #37387 (closed).
Steps to reproduce
- Create multiple Sidekiq processes in
gitlab.rb
sidekiq['queue_selector'] = true
sidekiq['queue_groups'] = [
'*',
'*',
'*'
]
- Start Sidekiq
- Within a few seconds the all but one of the
gauge_all_sidekiq_N-N.db
andgauge_max_sidekiq_N-N.db
files will be removed from/run/gitlab/sidekiq
What is the current bug behavior?
The last Sidekiq process to start will delete the gauge_all_sidekiq_N-N.db
and gauge_max_sidekiq_N-N.db
files owned by the other Sidekiq processes.
$ curl --silent localhost:8082/metrics | grep "^sidekiq_current_rss"
sidekiq_current_rss{pid="sidekiq_2"} 930432
What is the expected correct behavior?
Sidekiq processes do not delete each other's metrics.
$ curl --silent localhost:8082/metrics | grep "^sidekiq_current_rss"
sidekiq_current_rss{pid="sidekiq_0"} 985476
sidekiq_current_rss{pid="sidekiq_1"} 977452
sidekiq_current_rss{pid="sidekiq_2"} 979264
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
System information System: Debian 10 Proxy: no Current User: git Using RVM: no Ruby Version: 2.7.2p137 Gem Version: 3.1.4 Bundler Version:2.1.4 Rake Version: 13.0.3 Redis Version: 6.0.14 Git Version: 2.32.0 Sidekiq Version:5.2.9 Go Version: unknown GitLab information Version: 14.0.5-ee Revision: b044f06e4dd Directory: /opt/gitlab/embedded/service/gitlab-rails DB Adapter: PostgreSQL DB Version: 12.6 Elasticsearch: no Geo: no Using LDAP: no Using Omniauth: yes Omniauth Providers: GitLab Shell Version: 13.19.0 Repository storage paths: - default: /var/opt/gitlab/git-data/repositories GitLab Shell path: /opt/gitlab/embedded/service/gitlab-shell Git: /opt/gitlab/embedded/bin/git
Results of GitLab application Check
Expand for output related to the GitLab application check
Checking GitLab subtasks ...
Checking GitLab Shell ...
GitLab Shell: ... GitLab Shell version >= 13.19.0 ? ... OK (13.19.0) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Internal API available: OK Redis available via internal API: OK gitlab-shell self-check successful
Checking GitLab Shell ... Finished
Checking Gitaly ...
Gitaly: ... default ... OK
Checking Gitaly ... Finished
Checking Sidekiq ...
Sidekiq: ... Running? ... yes Number of Sidekiq processes (cluster/worker) ... 1/3
Checking Sidekiq ... Finished
Checking Incoming Email ...
Incoming Email: ... Reply by email is disabled in config/gitlab.yml
Checking Incoming Email ... Finished
Checking LDAP ...
LDAP: ... LDAP is disabled in config/gitlab.yml
Checking LDAP ... Finished
Checking GitLab App ...
Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... skipped (no tmp uploads folder yet) Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... 2/1 ... yes Redis version >= 5.0.0? ... yes Ruby version >= 2.7.2 ? ... yes (2.7.2) Git version >= 2.31.0 ? ... yes (2.32.0) Git user has default SSH configuration? ... yes Active users: ... 1 Is authorized keys file accessible? ... yes GitLab configured to store new projects in hashed storage? ... yes All projects are in hashed storage? ... yes Elasticsearch version 7.x (6.4 - 6.x deprecated to be removed in 13.8)? ... skipped (elasticsearch is disabled)
Checking GitLab App ... Finished
Checking GitLab subtasks ... Finished
Possible fixes
It seems likely that this was introduced by !53139 (merged), which moved Prometheus::CleanupMultiprocDirService
into the Gitlab::Cluster::LifecycleEvents.on_master_start
section. Instrumenting line 45 of the metrics initializer with:
if Gitlab::Runtime.sidekiq?
File.open('/tmp/cleanup_metrics.log', 'a') { |f| f.write("#{Time.now} pid #{Process.pid} - Execute CleanupMultiprocDir\n") }
end
We see that the CleanupMultiprocDirService
is being executed once per each of the three Sidekiq worker processes on this node:
2021-07-16 03:28:04 +0000 pid 26155 - Execute CleanupMultiprocDir
2021-07-16 03:28:04 +0000 pid 26157 - Execute CleanupMultiprocDir
2021-07-16 03:28:05 +0000 pid 26159 - Execute CleanupMultiprocDir
Checking these processes with lsof
we see that first two have had their files deleted:
$ lsof -p $(pgrep -fd, sidekiq) | grep "\.db.*deleted"
bundle 26155 git 7wW REG 0,21 4096 132607 /run/gitlab/sidekiq/gauge_max_sidekiq_0-0.db (deleted)
bundle 26155 git 8uW REG 0,21 4096 132607 /run/gitlab/sidekiq/gauge_max_sidekiq_0-0.db (deleted)
bundle 26155 git 9wW REG 0,21 16384 132608 /run/gitlab/sidekiq/gauge_all_sidekiq_0-0.db (deleted)
bundle 26155 git 10uW REG 0,21 16384 132608 /run/gitlab/sidekiq/gauge_all_sidekiq_0-0.db (deleted)
bundle 26157 git 7wW REG 0,21 4096 133470 /run/gitlab/sidekiq/gauge_max_sidekiq_1-0.db (deleted)
bundle 26157 git 8uW REG 0,21 4096 133470 /run/gitlab/sidekiq/gauge_max_sidekiq_1-0.db (deleted)
bundle 26157 git 9wW REG 0,21 16384 133471 /run/gitlab/sidekiq/gauge_all_sidekiq_1-0.db (deleted)
bundle 26157 git 10uW REG 0,21 16384 133471 /run/gitlab/sidekiq/gauge_all_sidekiq_1-0.db (deleted)