Discuss removal of histogram metrics on Sidekiq for self-managed
Context
In &700 (closed), we have replaced the SLI per-shard which aggregates both queueing and execution to separate queueing and execution SLIs. The new queueing and execution SLIs are counter-based metrics defined as Application SLIs.
The former queueing and execution SLIs were derived from the histograms sidekiq_jobs_queue_duration_seconds
and sidekiq_jobs_completion_seconds
to produce the Apdex. Since these histograms produce high cardinalities, we stopped emitting them in gitlab-org/gitlab!128706 (merged) for GitLab.com behind an ops FF. References to queueing and execution latency in dashboards have been replaced with Kibana visualizations.
Moving forward, the rest of the histograms for sidekiq_jobs_cpu_seconds
, sidekiq_jobs_db_seconds
, sidekiq_jobs_gitaly_seconds
, sidekiq_redis_requests_duration_seconds
, sidekiq_elasticsearch_requests_duration_seconds
would also be removed from GitLab.com. See #2297 (closed) for list of metrics being audited.
Discussion
This issue aims to discuss:
-
Since SM instances might still use the histograms, should we also stop emitting for everyone?
From my understanding, SM instances can import dashboards from https://gitlab.com/gitlab-org/grafana-dashboards. Quick check on the repo, there are no references to the histograms, but it doesn't block anyone to plug in the histograms in their dashboard. So we might need to follow the formal deprecation/removal process for this, which we could aim for the 17.0 release.
❯ rg 'sidekiq_jobs_completion_seconds|sidekiq_jobs_queue_duration_seconds|sidekiq_jobs_cpu_seconds|sidekiq_jobs_db_seconds|sidekiq_jobs_gitaly_seconds|sidekiq_redis_requests_duration_seconds|sidekiq_elasticsearch_requests_duration_seconds'
-
Should we also replace these histograms for GET Hybrid?