Make Sidekiq SLIs explorable in the error budget for stage groups dashboard
As @smcgivern
brings up in #1365 (comment 728055809) sidekiq_execution
isn't a real SLI, it's an aggregation we've tacked on in https://gitlab.com/gitlab-com/runbooks/blob/b08d1478fcc8b6ae05828710984da8976504da01/rules-jsonnet/sidekiq-feature-category-source-metrics.jsonnet#L9. They aggregate all executions into a single SLI, but the service metrics aggregate both the queueing and the execution into a single SLI per shard.
The queueing is not something that can easily be influenced by stage groups, but it is affected by the execution somewhat.
I think we should work to remove the disconnect between feature category recordings and service monitoring recordings, so Sidekiq is no longer a special case. &525 (closed) worked on making the puma
component feature recordings in line with the service recordings, we could do the same for Sidekiq.
This will make Sidekiq explorable on the error budget for stage group's detail dashboard including the breakdown by Significant label, which includes worker
.
Proposal
Reuse the the tools we built in &525 (closed) and move from a histogram for apdex measurements and counters for error rates to 8 counters defined as Application SLIs. This means in the application we would have the following counters:
- For execution error ratio
-
gitlab_sli:sidekiq_job_execution:total
: incremented for all jobs -
gitlab_sli:sidekiq_job_execution:errors_total
: incremented on a job failure
-
- For execution apdex
-
gitlab_sli:sidekiq_job_apdex:total
: incremented for all successful jobs -
gitlab_sli:sidekiq_job_apdex:success_total
: incremented if the job completed fast enough depending on the urgency it has defined on the class (10s for urgent jobs, 5 minutes for others.
-
- For queuing apdex
-
gitlab_sli:sidekiq_job_queuing_apdex:total
: incremented for all jobs when they start -
gitlab_sli:sidekiq_job_queueing_apdex:success_total
: incremented if the queuing time was satisfactory depending on the urgency for the worker (10s for urgent jobs, 60s for low urgency, always increment for throttled jobs)
-
These counters should have sufficient labels to help stage groups investigate problems (external_dependencies, worker, feature_category, urgency, ...)
Then we can use these counters to generate 2 SLIs: One for execution (with apdex and error rate), one for queueing (only apdex). When defining this SLI on the Sidekiq service, we should label it with featureCategory: fromSourceMetrics
and remove the double recording that we currently have for error budgets for stage groups. (Discussion: should this be 2 SLIs per shard, or 2 SLIs in general while allowing to drill down by shard & worker?)
When this is done, we can remove the high cardinality histograms and their recordings. We should also consider if we can remove the old counters.
Goal
When this is done, we'll have more detail available in metrics to show to Stage groups which workers are affecting their error budget. While also reducing cardinality of the metrics used.