Skip to content

Replace sidekiq resource usage histograms with sum

Gregorius Marco requested to merge mg-remove-more-sidekiq-histograms into master

What does this MR do and why?

In an effort to audit unused Sidekiq metrics in gitlab-com/gl-infra/scalability#2297 (closed), we're replacing the histogram metrics for CPU, DB, Gitaly, Redis, and ES duration. They are replaced with a simple sum counter (which already existed before implicitly as part of histogram metrics). Histogram metrics are very high in cardinality by nature and don't really provide that high of accuracy anyway.

This MR also added sidekiq_jobs_completion_count as a counter that gets incremented after each job completion. This counter is meant to replace the 5 counters from each histogram (CPU, DB, gitaly, redis, ES), so we're saving around 400k series gitlab-com/gl-infra/scalability#2297 (comment 1580045270).

The sum / count metric shows average duration of the resource usage (CPU, DB, Gitaly, Redis, ES duration).

More context: gitlab-com/gl-infra/scalability#2297 (comment 1569512620)

Note:

  • Histogram metrics implicitly emit 3 series: *_bucket, *_sum and *_count. The count for the 5 metrics are being used in GitLab.com dashboards to display averages, here for example.
  • The FF emit_sidekiq_histogram_metrics (introduced in !128706 (merged)) has already been disabled for GitLab.com. The dashboard should be prepared to consume either the sidekiq_jobs_cpu_seconds_count or sidekiq_jobs_completion_count to prevent missing data in the dashboards, this is handled in this runbooks MR.

How to set up and validate locally

  1. Run a worker Chaos::SleepWorker.perform_async(1) in Rails console.
  2. Check the buckets exist
❯ curl -s 'gdk.test:3807/metrics' | grep Chaos::SleepWorker | grep bucket | grep sidekiq | grep -E 'cpu|db|gitaly|redis|elasticsearch'
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="2.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="2.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="2.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="2.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_redis_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_redis_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_redis_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.25",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_redis_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
  1. Turn off FF Feature.disable(:emit_sidekiq_histogram_metrics)
  2. Restart sidekiq gdk restart rails-background-jobs
  3. Run the worker again Chaos::SleepWorker.perform_async(1)
  4. No buckets returned
❯ curl -s 'gdk.test:3807/metrics' | grep Chaos::SleepWorker | grep bucket | grep sidekiq | grep -E 'cpu|db|gitaly|redis|elasticsearch'
  1. Verify sidekiq_jobs_completion_count emitted:
❯ curl -s 'gdk.test:3807/metrics' | grep sidekiq_jobs_completion_count | grep Chaos::SleepWorker
sidekiq_jobs_completion_count{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 1
  1. Verify sum counters are emitted:
❯ curl -s 'gdk.test:3807/metrics' | grep Chaos::SleepWorker | grep sidekiq | grep sum | grep -E 'cpu|db|gitaly|redis|elasticsearch'
sidekiq_elasticsearch_requests_duration_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0
sidekiq_jobs_cpu_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0.0076420839999999934
sidekiq_jobs_db_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0.0065749998092651364
sidekiq_jobs_gitaly_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0
sidekiq_redis_requests_duration_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0.001622

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Gregorius Marco

Merge request reports