2022-10-29: Error burn rate exceeding SLO for sidekiq low-urgency CPU-bound shard
The DRI for this incident is the incident issue assignee, see roles and responsibilities.
For the assigned roles when the incident declared, see the Timelines tab. For timeline feedback see the dogfooding issue. To save time entering timeline events, use the quick action /timeline.
Current Status
This appears to be a false alarm. The sidekiq service is healthy and available, but a single project and user is running a large number of jobs that are failing in what appears to be a legitimate way. So far I see no indication of a systemic problem that would spill over to affect other unrelated jobs or service availability.
In more detail:
Within the "low-urgency and cpu-bound" group of job classes, the rate of job failures really is spiking, but it is due to a single project's recurring (but failing) attempts to perform a GitHub to GitLab import. Specifically, the step of looking up GitHub users via API calls is repeatedly failing to find certain users, and this counts as a sidekiq job failure:
- Stack trace: #7959 (comment 1153475153)
- Job details (internal only): #7959 (comment 1153477398)
📚 References and helpful links
Recent Events (available internally only):
- Deployments ❙ Feature Flag Changes ❙ Gitlab.com Latest Updates
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.