Sidekiq RubySampler metrics missing due to prometheus metrics files unlinked

Summary

Sidekiq prometheus metrics files tmp/prometheus_multiproc_dir/sidekiq/gauge_all_sidekiq-*.db is(are) created by

# from file `config/initializers/7_prometheus_metrics.rb`

if !Rails.env.test? && Gitlab::Metrics.prometheus_metrics_enabled?
  Gitlab::Cluster::LifecycleEvents.on_worker_start do
    defined?(::Prometheus::Client.reinitialize_on_pid_change) && Prometheus::Client.reinitialize_on_pid_change

    Gitlab::Metrics::Samplers::RubySampler.initialize_instance(Settings.monitoring.ruby_sampler_interval).start
  end

  Gitlab::Cluster::LifecycleEvents.on_master_start do
    ::Prometheus::Client.reinitialize_on_pid_change(force: true)

But the file(s) is(are) unlinked later by:

# from file `config/initializers/7_prometheus_metrics.rb`

Sidekiq.configure_server do |config|
  config.on(:startup) do
    # webserver metrics are cleaned up in config.ru: `warmup` block
    Prometheus::CleanupMultiprocDirService.new.execute
    Gitlab::Metrics::SidekiqMetricsExporter.instance.start
  end
end

As a result, Sidekiq RubySampler metrics are all missing.

Why CleanupMultiprocDirService happens later than reinitialize_on_pid_change?

Because in Gitlab::Cluster::LifecycleEvents.on_worker_start, it does the yield action immediately, instead of hook the code block to Sidekiq onstart event.

        def on_worker_start(&block)
          if in_clustered_environment?
            # Defer block execution
            (@worker_start_hooks ||= []) << block
          else
            yield
          end
        end

        def in_clustered_environment?
          # Sidekiq doesn't fork
          return false if Sidekiq.server?

          # Unicorn always forks
          return true if defined?(::Unicorn)

          # Puma sometimes forks
          return true if in_clustered_puma?

          # Default assumption is that we don't fork
          false
        end

Steps to reproduce

This is the steps to reproduce this issue in GDK:

in config/gitlab.yml, enable Sidekiq_exporter sidekiq_exporter: enabled: true address: localhost port: 3807
delete all files under tmp/prometheus_multiproc_dir/sidekiq
gdk run
watch the folder tmp/prometheus_multiproc_dir/sidekiq, the folder files changes: 4.1) two files gauge_all_sidekiq-1.db and gauge_all_sidekiq-0.db are created — this happens around the time rails-background-jobs started Sidekiq, by config/initializers/7_prometheus_metrics.rb, where

if !Rails.env.test? && Gitlab::Metrics.prometheus_metrics_enabled?
  Gitlab::Cluster::LifecycleEvents.on_worker_start do
    defined?(::Prometheus::Client.reinitialize_on_pid_change) && Prometheus::Client.reinitialize_on_pid_change

    Gitlab::Metrics::Samplers::RubySampler.initialize_instance(Settings.monitoring.ruby_sampler_interval).start
  end

  Gitlab::Cluster::LifecycleEvents.on_master_start do
    ::Prometheus::Client.reinitialize_on_pid_change(force: true)

4.2) after a while, all files are deleted — this happens around the time rails-web initialize, by config/initializers/7_prometheus_metrics.rb, where it add some cleanup logic:

Sidekiq.configure_server do |config|
  config.on(:startup) do
    # webserver metrics are cleaned up in config.ru: `warmup` block
    Prometheus::CleanupMultiprocDirService.new.execute
    Gitlab::Metrics::SidekiqMetricsExporter.instance.start
  end
end

After that, files gauge_all_sidekiq-1.db and gauge_all_sidekiq-0.db are never re-created. Run curl http://localhost:3807/metrics | grep ruby_gc_stat_total_freed_objects, there is no data.

What is the current bug behavior?

In GDK, enable Sidekiq_exporter and run curl http://localhost:3807/metrics | grep ruby_gc_stat_total_freed_objects, there is no data.

What is the expected correct behavior?

In GDK, enable Sidekiq_exporter and run curl http://localhost:3807/metrics | grep ruby_gc_stat_total_freed_objects, we should see Rubysampler data.

Possible fixes

There are 2 options to fix this issue:

Option 1) this is more like a patch: reinitialize the prometheus metrics files after cleanup.

config/initializers/7_prometheus_metrics.rb, where it add some cleanup logic:

Sidekiq.configure_server do |config|
  config.on(:startup) do
    # webserver metrics are cleaned up in config.ru: `warmup` block
    Prometheus::CleanupMultiprocDirService.new.execute

    # fix option 1, reinitialized after cleanup action    
    ::Prometheus::Client.reinitialize_on_pid_change(force: true)

    Gitlab::Metrics::SidekiqMetricsExporter.instance.start
  end
end

Option 2) this sounds more correct but more complex. Make Gitlab::Cluster::LifecycleEvents.on_worker_start really happen at onstart life cycle event for Sidekiq.

We need to change 2 locations:

Location 1, in file lib/gitlab/cluster/lifecycle_events.rb, do NOT yield onstart code block immediately

        def on_worker_start(&block)
#           if in_clustered_environment?
            # Defer block execution
            (@worker_start_hooks ||= []) << block
#           else
#             yield
#           end
        end

Note: there may need more change in `Gitlab::Cluster::LifecycleEvents`, it seems we need to understand why we treat Sidekiq this way in current Gitlab::Cluster::LifecycleEvents design:

        def in_clustered_environment?
          # Sidekiq doesn't fork
          return false if Sidekiq.server?
         ...
        end

Location 2, in file config/initializers/sidekiq.rb

  config.on :startup do
    # Clear any connections that might have been obtained before starting
    # Sidekiq (e.g. in an initializer).
    ActiveRecord::Base.clear_all_connections!

    # Start monitor to track running jobs. By default, cancel job is not enabled
    # To cancel job, it requires `SIDEKIQ_MONITOR_WORKER=1` to enable notification channel
    Gitlab::SidekiqDaemon::Monitor.instance.start

    # option 2, Signal application hooks of worker start
    Gitlab::Cluster::LifecycleEvents.do_worker_start
   
    Gitlab::SidekiqDaemon::MemoryKiller.instance.start if enable_sidekiq_memory_killer && use_sidekiq_daemon_memory_killer
  end

Edited Oct 27, 2021 by 🤖 GitLab Bot 🤖