Skip to content

Per-process Prometheus metrics for Sidekiq are missing for all but one process

Summary

When multiple Sidekiq processes are present, the last one to start will delete the per-process metrics for all other processes.

Expand for list of impacted metrics
gitlab_database_connection_pool_busy
gitlab_database_connection_pool_connections
gitlab_database_connection_pool_dead
gitlab_database_connection_pool_idle
gitlab_database_connection_pool_size
gitlab_database_connection_pool_waiting
gitlab_ruby_threads_max_expected_threads
gitlab_ruby_threads_running_threads
ruby_file_descriptors
ruby_gc_stat_compact_count
ruby_gc_stat_count
ruby_gc_stat_heap_allocatable_pages
ruby_gc_stat_heap_allocated_pages
ruby_gc_stat_heap_available_slots
ruby_gc_stat_heap_eden_pages
ruby_gc_stat_heap_final_slots
ruby_gc_stat_heap_free_slots
ruby_gc_stat_heap_live_slots
ruby_gc_stat_heap_marked_slots
ruby_gc_stat_heap_sorted_length
ruby_gc_stat_heap_tomb_pages
ruby_gc_stat_major_gc_count
ruby_gc_stat_malloc_increase_bytes
ruby_gc_stat_malloc_increase_bytes_limit
ruby_gc_stat_minor_gc_count
ruby_gc_stat_old_objects
ruby_gc_stat_old_objects_limit
ruby_gc_stat_oldmalloc_increase_bytes
ruby_gc_stat_oldmalloc_increase_bytes_limit
ruby_gc_stat_remembered_wb_unprotected_objects
ruby_gc_stat_remembered_wb_unprotected_objects_limit
ruby_gc_stat_total_allocated_objects
ruby_gc_stat_total_allocated_pages
ruby_gc_stat_total_freed_objects
ruby_gc_stat_total_freed_pages
ruby_process_cpu_seconds_total
ruby_process_max_fds
ruby_process_proportional_memory_bytes
ruby_process_resident_memory_bytes
ruby_process_start_time_seconds
ruby_process_unique_memory_bytes
sidekiq_concurrency
sidekiq_current_rss
sidekiq_memory_killer_hard_limit_rss
sidekiq_memory_killer_phase
sidekiq_memory_killer_soft_limit_rss
sidekiq_running_jobs

This is very similar to #37387 (closed).

Steps to reproduce

  • Create multiple Sidekiq processes in gitlab.rb
sidekiq['queue_selector'] = true
sidekiq['queue_groups'] = [
  '*',
  '*',
  '*'
]
  • Start Sidekiq
  • Within a few seconds the all but one of the gauge_all_sidekiq_N-N.db and gauge_max_sidekiq_N-N.db files will be removed from /run/gitlab/sidekiq

What is the current bug behavior?

The last Sidekiq process to start will delete the gauge_all_sidekiq_N-N.db and gauge_max_sidekiq_N-N.db files owned by the other Sidekiq processes.

$ curl --silent localhost:8082/metrics | grep "^sidekiq_current_rss"
sidekiq_current_rss{pid="sidekiq_2"} 930432

What is the expected correct behavior?

Sidekiq processes do not delete each other's metrics.

$ curl --silent localhost:8082/metrics | grep "^sidekiq_current_rss"
sidekiq_current_rss{pid="sidekiq_0"} 985476
sidekiq_current_rss{pid="sidekiq_1"} 977452
sidekiq_current_rss{pid="sidekiq_2"} 979264

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

System information
System:         Debian 10
Proxy:          no
Current User:   git
Using RVM:      no
Ruby Version:   2.7.2p137
Gem Version:    3.1.4
Bundler Version:2.1.4
Rake Version:   13.0.3
Redis Version:  6.0.14
Git Version:    2.32.0
Sidekiq Version:5.2.9
Go Version:     unknown

GitLab information
Version:        14.0.5-ee
Revision:       b044f06e4dd
Directory:      /opt/gitlab/embedded/service/gitlab-rails
DB Adapter:     PostgreSQL
DB Version:     12.6
Elasticsearch:  no
Geo:            no
Using LDAP:     no
Using Omniauth: yes
Omniauth Providers:

GitLab Shell
Version:        13.19.0
Repository storage paths:
- default:      /var/opt/gitlab/git-data/repositories
GitLab Shell path:              /opt/gitlab/embedded/service/gitlab-shell
Git:            /opt/gitlab/embedded/bin/git

Results of GitLab application Check

Expand for output related to the GitLab application check

Checking GitLab subtasks ...

Checking GitLab Shell ...

GitLab Shell: ... GitLab Shell version >= 13.19.0 ? ... OK (13.19.0) Running /opt/gitlab/embedded/service/gitlab-shell/bin/check Internal API available: OK Redis available via internal API: OK gitlab-shell self-check successful

Checking GitLab Shell ... Finished

Checking Gitaly ...

Gitaly: ... default ... OK

Checking Gitaly ... Finished

Checking Sidekiq ...

Sidekiq: ... Running? ... yes Number of Sidekiq processes (cluster/worker) ... 1/3

Checking Sidekiq ... Finished

Checking Incoming Email ...

Incoming Email: ... Reply by email is disabled in config/gitlab.yml

Checking Incoming Email ... Finished

Checking LDAP ...

LDAP: ... LDAP is disabled in config/gitlab.yml

Checking LDAP ... Finished

Checking GitLab App ...

Git configured correctly? ... yes Database config exists? ... yes All migrations up? ... yes Database contains orphaned GroupMembers? ... no GitLab config exists? ... yes GitLab config up to date? ... yes Log directory writable? ... yes Tmp directory writable? ... yes Uploads directory exists? ... yes Uploads directory has correct permissions? ... yes Uploads directory tmp has correct permissions? ... skipped (no tmp uploads folder yet) Init script exists? ... skipped (omnibus-gitlab has no init script) Init script up-to-date? ... skipped (omnibus-gitlab has no init script) Projects have namespace: ... 2/1 ... yes Redis version >= 5.0.0? ... yes Ruby version >= 2.7.2 ? ... yes (2.7.2) Git version >= 2.31.0 ? ... yes (2.32.0) Git user has default SSH configuration? ... yes Active users: ... 1 Is authorized keys file accessible? ... yes GitLab configured to store new projects in hashed storage? ... yes All projects are in hashed storage? ... yes Elasticsearch version 7.x (6.4 - 6.x deprecated to be removed in 13.8)? ... skipped (elasticsearch is disabled)

Checking GitLab App ... Finished

Checking GitLab subtasks ... Finished

Possible fixes

It seems likely that this was introduced by !53139 (merged), which moved Prometheus::CleanupMultiprocDirService into the Gitlab::Cluster::LifecycleEvents.on_master_start section. Instrumenting line 45 of the metrics initializer with:

if Gitlab::Runtime.sidekiq?
  File.open('/tmp/cleanup_metrics.log', 'a') { |f| f.write("#{Time.now} pid #{Process.pid} - Execute CleanupMultiprocDir\n") }
end

We see that the CleanupMultiprocDirService is being executed once per each of the three Sidekiq worker processes on this node:

2021-07-16 03:28:04 +0000 pid 26155 - Execute CleanupMultiprocDir
2021-07-16 03:28:04 +0000 pid 26157 - Execute CleanupMultiprocDir
2021-07-16 03:28:05 +0000 pid 26159 - Execute CleanupMultiprocDir

Checking these processes with lsof we see that first two have had their files deleted:

$ lsof -p $(pgrep -fd, sidekiq) | grep "\.db.*deleted" 
bundle  26155  git    7wW     REG               0,21     4096  132607 /run/gitlab/sidekiq/gauge_max_sidekiq_0-0.db (deleted)
bundle  26155  git    8uW     REG               0,21     4096  132607 /run/gitlab/sidekiq/gauge_max_sidekiq_0-0.db (deleted)
bundle  26155  git    9wW     REG               0,21    16384  132608 /run/gitlab/sidekiq/gauge_all_sidekiq_0-0.db (deleted)
bundle  26155  git   10uW     REG               0,21    16384  132608 /run/gitlab/sidekiq/gauge_all_sidekiq_0-0.db (deleted)
bundle  26157  git    7wW     REG               0,21     4096  133470 /run/gitlab/sidekiq/gauge_max_sidekiq_1-0.db (deleted)
bundle  26157  git    8uW     REG               0,21     4096  133470 /run/gitlab/sidekiq/gauge_max_sidekiq_1-0.db (deleted)
bundle  26157  git    9wW     REG               0,21    16384  133471 /run/gitlab/sidekiq/gauge_all_sidekiq_1-0.db (deleted)
bundle  26157  git   10uW     REG               0,21    16384  133471 /run/gitlab/sidekiq/gauge_all_sidekiq_1-0.db (deleted)

/cc @alipniagov @mkaeppler

Edited by Will Chandler (ex-GitLab)