Skip to content

Raise sidekiq contractual error rate threshold to 0.1

Why should we change this contractual threshold?

  1. Aligning user perception with reality. Sidekiq traffic is not similar to synchronous traffic systems like web or api. Sidekiq work is asynchronous and can be retried, or the work may not be expected to work successfully every time. Synchronous traffic makes singular requests that are expected to work without retrying. It's important for us to know when sidekiq jobs are not completing, but this may not have any noticeable impact on users of GitLab.com. This leads to our SLA metric not aligning with the perception of work being completed without issue.

  2. Our SLA metrics do not align with our efforts to improve Sidekiq work via notifications and corrective actions/infradev issues. With our current threshold of 0.005, we are recording large violations of our SLA while we do not alert EOCs of a problem or record incidents. Our alerting thresholds seem to be more in line with the perception of sidekiq's efficacy rather than the SLA metric. And, even if they are not, our alerting thresholds are the measure by which we determine to focus resources and our corrective action (or infradev) process to bear on issues that are affecting users of GitLab.com. Our contractual thresholds should be, universally, a more conservative measure of errors than the contractual thresholds.

Related Issue: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24603

Edited by Cameron McFarland

Merge request reports