Sidekiq RubySampler metrics missing due to prometheus metrics files unlinked
Summary
Sidekiq prometheus metrics files tmp/prometheus_multiproc_dir/sidekiq/gauge_all_sidekiq-*.db
is(are) created by
# from file `config/initializers/7_prometheus_metrics.rb`
if !Rails.env.test? && Gitlab::Metrics.prometheus_metrics_enabled?
Gitlab::Cluster::LifecycleEvents.on_worker_start do
defined?(::Prometheus::Client.reinitialize_on_pid_change) && Prometheus::Client.reinitialize_on_pid_change
Gitlab::Metrics::Samplers::RubySampler.initialize_instance(Settings.monitoring.ruby_sampler_interval).start
end
Gitlab::Cluster::LifecycleEvents.on_master_start do
::Prometheus::Client.reinitialize_on_pid_change(force: true)
But the file(s) is(are) unlinked later by:
# from file `config/initializers/7_prometheus_metrics.rb`
Sidekiq.configure_server do |config|
config.on(:startup) do
# webserver metrics are cleaned up in config.ru: `warmup` block
Prometheus::CleanupMultiprocDirService.new.execute
Gitlab::Metrics::SidekiqMetricsExporter.instance.start
end
end
As a result, Sidekiq RubySampler metrics are all missing.
Why CleanupMultiprocDirService happens later than reinitialize_on_pid_change?
Because in Gitlab::Cluster::LifecycleEvents.on_worker_start
, it does the yield
action immediately, instead of hook the code block to Sidekiq onstart event.
def on_worker_start(&block)
if in_clustered_environment?
# Defer block execution
(@worker_start_hooks ||= []) << block
else
yield
end
end
def in_clustered_environment?
# Sidekiq doesn't fork
return false if Sidekiq.server?
# Unicorn always forks
return true if defined?(::Unicorn)
# Puma sometimes forks
return true if in_clustered_puma?
# Default assumption is that we don't fork
false
end
Steps to reproduce
This is the steps to reproduce this issue in GDK:
- in config/gitlab.yml, enable Sidekiq_exporter sidekiq_exporter: enabled: true address: localhost port: 3807
- delete all files under tmp/prometheus_multiproc_dir/sidekiq
- gdk run
- watch the folder tmp/prometheus_multiproc_dir/sidekiq, the folder files changes:
4.1) two files gauge_all_sidekiq-1.db and gauge_all_sidekiq-0.db are created — this happens around the time rails-background-jobs started Sidekiq, by
config/initializers/7_prometheus_metrics.rb
, where
if !Rails.env.test? && Gitlab::Metrics.prometheus_metrics_enabled?
Gitlab::Cluster::LifecycleEvents.on_worker_start do
defined?(::Prometheus::Client.reinitialize_on_pid_change) && Prometheus::Client.reinitialize_on_pid_change
Gitlab::Metrics::Samplers::RubySampler.initialize_instance(Settings.monitoring.ruby_sampler_interval).start
end
Gitlab::Cluster::LifecycleEvents.on_master_start do
::Prometheus::Client.reinitialize_on_pid_change(force: true)
4.2) after a while, all files are deleted — this happens around the time rails-web initialize, by config/initializers/7_prometheus_metrics.rb
, where it add some cleanup logic:
Sidekiq.configure_server do |config|
config.on(:startup) do
# webserver metrics are cleaned up in config.ru: `warmup` block
Prometheus::CleanupMultiprocDirService.new.execute
Gitlab::Metrics::SidekiqMetricsExporter.instance.start
end
end
After that, files gauge_all_sidekiq-1.db and gauge_all_sidekiq-0.db are never re-created.
Run curl http://localhost:3807/metrics | grep ruby_gc_stat_total_freed_objects
, there is no data.
What is the current bug behavior?
In GDK, enable Sidekiq_exporter and run curl http://localhost:3807/metrics | grep ruby_gc_stat_total_freed_objects
, there is no data.
What is the expected correct behavior?
In GDK, enable Sidekiq_exporter and run curl http://localhost:3807/metrics | grep ruby_gc_stat_total_freed_objects
, we should see Rubysampler data.
Possible fixes
There are 2 options to fix this issue:
Option 1) this is more like a patch
: reinitialize the prometheus metrics files after cleanup.
config/initializers/7_prometheus_metrics.rb
, where it add some cleanup logic:
Sidekiq.configure_server do |config|
config.on(:startup) do
# webserver metrics are cleaned up in config.ru: `warmup` block
Prometheus::CleanupMultiprocDirService.new.execute
# fix option 1, reinitialized after cleanup action
::Prometheus::Client.reinitialize_on_pid_change(force: true)
Gitlab::Metrics::SidekiqMetricsExporter.instance.start
end
end
Option 2) this sounds more correct
but more complex. Make Gitlab::Cluster::LifecycleEvents.on_worker_start
really happen at onstart
life cycle event for Sidekiq.
We need to change 2 locations:
Location 1, in file lib/gitlab/cluster/lifecycle_events.rb
, do NOT yield
onstart code block immediately
def on_worker_start(&block)
# if in_clustered_environment?
# Defer block execution
(@worker_start_hooks ||= []) << block
# else
# yield
# end
end
Gitlab::Cluster::LifecycleEvents
, it seems we need to understand why we treat Sidekiq this way in current Gitlab::Cluster::LifecycleEvents design:
Note: there may need more change in def in_clustered_environment?
# Sidekiq doesn't fork
return false if Sidekiq.server?
...
end
Location 2, in file config/initializers/sidekiq.rb
config.on :startup do
# Clear any connections that might have been obtained before starting
# Sidekiq (e.g. in an initializer).
ActiveRecord::Base.clear_all_connections!
# Start monitor to track running jobs. By default, cancel job is not enabled
# To cancel job, it requires `SIDEKIQ_MONITOR_WORKER=1` to enable notification channel
Gitlab::SidekiqDaemon::Monitor.instance.start
# option 2, Signal application hooks of worker start
Gitlab::Cluster::LifecycleEvents.do_worker_start
Gitlab::SidekiqDaemon::MemoryKiller.instance.start if enable_sidekiq_memory_killer && use_sidekiq_daemon_memory_killer
end