Investigate SLO violations in `MigrateExternalDiffsWorker`

Context

The MigrateExternalDiffsWorker worker frequently triggers an SLO violation: SidekiqServiceWorkerExecutionErrorSLOViolation.

This happens multiple times a day and could cause alert fatigue, making us overlook real issues reported by the alerts.

Example alert posted in Slack
*SidekiqServiceWorkerExecutionErrorSLOViolation*
The MigrateExternalDiffsWorker Sidekiq worker, main stage, has an error rate violating SLO
The MigrateExternalDiffsWorker worker is not meeting its error-rate SLO.
Currently the error-rate is 94.78%.

:label: Labels :label:
alertname: SidekiqServiceWorkerExecutionErrorSLOViolation
aggregation: sidekiq_execution
alert_type: symptom
env: gprd
external_dependencies: no
queue: default
region: us-east1
shard: catchall
sli_type: error
stage: main
tier: sv
type: sidekiq
urgency: low
window: 6h
worker: MigrateExternalDiffsWorker

Goal

  • Understand why this worker violates the SLO so frequently
  • Remediate it on the spot if simple, otherwise create a follow up issue

Confirmed findings

The SLO violation started on 2024-11-21 at 5:08 UTC.

Edited by François Rosé