Replace sidekiq resource usage histograms with sum
What does this MR do and why?
In an effort to audit unused Sidekiq metrics in gitlab-com/gl-infra/scalability#2297 (closed), we're replacing the histogram metrics for CPU, DB, Gitaly, Redis, and ES duration. They are replaced with a simple sum counter (which already existed before implicitly as part of histogram metrics). Histogram metrics are very high in cardinality by nature and don't really provide that high of accuracy anyway.
This MR also added sidekiq_jobs_completion_count
as a counter that gets
incremented after each job completion. This counter is meant to replace the 5 counters from each histogram (CPU, DB, gitaly, redis, ES), so we're saving around 400k series gitlab-com/gl-infra/scalability#2297 (comment 1580045270).
The sum / count
metric shows average duration of the resource usage
(CPU, DB, Gitaly, Redis, ES duration).
More context: gitlab-com/gl-infra/scalability#2297 (comment 1569512620)
Note:
- Histogram metrics implicitly emit 3 series:
*_bucket
,*_sum
and*_count
. The count for the 5 metrics are being used in GitLab.com dashboards to display averages, here for example. - The FF
emit_sidekiq_histogram_metrics
(introduced in !128706 (merged)) has already been disabled for GitLab.com. The dashboard should be prepared to consume either thesidekiq_jobs_cpu_seconds_count
orsidekiq_jobs_completion_count
to prevent missing data in the dashboards, this is handled in this runbooks MR.
How to set up and validate locally
- Run a worker
Chaos::SleepWorker.perform_async(1)
in Rails console. - Check the buckets exist
❯ curl -s 'gdk.test:3807/metrics' | grep Chaos::SleepWorker | grep bucket | grep sidekiq | grep -E 'cpu|db|gitaly|redis|elasticsearch'
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_elasticsearch_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="2.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_cpu_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="2.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_db_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="2.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_jobs_gitaly_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="2.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_redis_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="+Inf",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_redis_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.1",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_redis_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.25",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
sidekiq_redis_requests_duration_seconds_bucket{boundary="",external_dependencies="no",feature_category="not_owned",job_status="done",le="0.5",queue="default",urgency="low",worker="Chaos::SleepWorker"} 5
- Turn off FF
Feature.disable(:emit_sidekiq_histogram_metrics)
- Restart sidekiq
gdk restart rails-background-jobs
- Run the worker again
Chaos::SleepWorker.perform_async(1)
- No buckets returned
❯ curl -s 'gdk.test:3807/metrics' | grep Chaos::SleepWorker | grep bucket | grep sidekiq | grep -E 'cpu|db|gitaly|redis|elasticsearch'
- Verify
sidekiq_jobs_completion_count
emitted:
❯ curl -s 'gdk.test:3807/metrics' | grep sidekiq_jobs_completion_count | grep Chaos::SleepWorker
sidekiq_jobs_completion_count{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 1
- Verify sum counters are emitted:
❯ curl -s 'gdk.test:3807/metrics' | grep Chaos::SleepWorker | grep sidekiq | grep sum | grep -E 'cpu|db|gitaly|redis|elasticsearch'
sidekiq_elasticsearch_requests_duration_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0
sidekiq_jobs_cpu_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0.0076420839999999934
sidekiq_jobs_db_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0.0065749998092651364
sidekiq_jobs_gitaly_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0
sidekiq_redis_requests_duration_seconds_sum{boundary="",external_dependencies="no",feature_category="not_owned",queue="default",urgency="low",worker="Chaos::SleepWorker"} 0.001622
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.