2020-09-07 degredation in sidekiq apdex, caused by a saturation of shard_urgent_cpu_bound, caused by performance issues with UpdateMergeRequestsWorker
Summary
2020-09-07 sidekiq errors
Degredation in sidekiq apdex, caused by a saturation of shard_urgent_cpu_bound, caused by performance issues with UpdateMergeRequestsWorker
Timeline
All times UTC.
2020-xx-xx
- we occasionally experience issues with
UpdateMergeRequestsWorker
, see: gitlab-org/gitlab#218410 (closed)
2020-09-07
- 09:05 - a particularly bad case of poor performance of this worker begins, these jobs run for a very long time (as already reported in the issue linked above), other jobs in the
shard_urgent_cpu_bound
are enqueued - 09:06 - apdex begins to degrade
- 09:23 - the faulty jobs finish running, the queues are still high, workers start crunching through the queues
- 09:25 - the EOC is paged for an SLO alert, at this point the queues are almost empty and the workers are almost finished catching up with the backlog
- 09:27 - mwasilewski declares incident in Slack using
/incident declare
command. - 09:30 - apdex recovers, the alert clears
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected:
- Team attribution:
- Minutes downtime or degradation:
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Edited by Michal Wasilewski