Sidekiq worker classes should be annotated with criticality
Introduction
At present, all Sidekiq jobs use a single threshold for error alerting: 90%
This threshold is a compromise: it takes into account highly critical jobs, such as post_receive while also handling non-critical jobs that may interact with unreliable external services (such as user-managed kubernetes clusters).
Unfortunately, this compromise is bad in both cases: a 90% success rate for post_receive should be considered critical, while a 90% success rate for gcp_cluster:cluster_wait_for_ingress_ip_address would be fantastic.
In issues such as https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10901, EOCs receive alerts for non-critical jobs. At the same time, we may be missing critical situations for other jobs.
One option would be to add rules in our alerting configuration to segregate jobs into different classes. However, this approach split the configuration between the team running the infrastructure and the team owning the jobs. This situation in the past lead to the poor state of the Sidekiq fleet, which ultimately led to the attribution work, to ensure that the owners of sidekiq jobs are responsible for attribution.
Proposal
Add another attribute to Sidekiq workers, criticality.
The values for this attribute could be: high, default and low.
This is similar to the concepts suggested in the Alerting at Scale section of the SRE Workbook: https://landing.google.com/sre/workbook/chapters/alerting-on-slos/#alerting_at_scale
This could be added to the Prometheus labels for Sidekiq jobs.
Our alerting would use thresholds based on these values, for example:
| Criticality | Error Rate SLA |
|---|---|
criticality="high" |
99.9% |
criticality="default" |
90% |
criticality="low" |
80% |
Over time, the hope would be that these thresholds could be raised.
This is how the attribute would be defined on the worker class:
class PostReceive
include ApplicationWorker
feature_category :source_code_management
urgency :high
worker_resource_boundary :cpu
criticality :high
This approach would be analogous to the urgency attribute that defines latency thresholds. The difference being that that is for latency, while this is for failure rates.