Investigate SLO violations in `MigrateExternalDiffsWorker`
Context
The MigrateExternalDiffsWorker worker frequently triggers an SLO violation: SidekiqServiceWorkerExecutionErrorSLOViolation.
This happens multiple times a day and could cause alert fatigue, making us overlook real issues reported by the alerts.
Example alert posted in Slack
*SidekiqServiceWorkerExecutionErrorSLOViolation*
The MigrateExternalDiffsWorker Sidekiq worker, main stage, has an error rate violating SLO
The MigrateExternalDiffsWorker worker is not meeting its error-rate SLO.
Currently the error-rate is 94.78%.
:label: Labels :label:
alertname: SidekiqServiceWorkerExecutionErrorSLOViolation
aggregation: sidekiq_execution
alert_type: symptom
env: gprd
external_dependencies: no
queue: default
region: us-east1
shard: catchall
sli_type: error
stage: main
tier: sv
type: sidekiq
urgency: low
window: 6h
worker: MigrateExternalDiffsWorker
Goal
- Understand why this worker violates the SLO so frequently
- Next step: use Teleport to investigate, see #506833 (comment 2247604717)
- Remediate it on the spot if simple, otherwise create a follow up issue
Confirmed findings
The SLO violation started on 2024-11-21 at 5:08 UTC.
Edited by François Rosé