2020-12-03: component shared_runner_queues of ci-runners service has an apdex-scope outside of SLO
Summary
Context will be added here as we investigate.
Timeline
All times UTC.
2020-12-03
- 14:52 - CI Runner Apdex begins to fall
- 15:53 - cmcfarland declares incident in Slack.
- 16:39 - CI Runner Apdex recovers
Corrective Actions
Incident Review
Summary
- Service(s) affected: CI Runners
- Team attribution:
- Time to detection:
- Minutes downtime or degradation: 110 minutes
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Any jobs using shared runners may have taken a longer than normal time to start.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Jobs took a longer than normal time to start, which could be perceived as a general long running CI process for projects.
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
("5 Whys")
Incident Response Analysis
-
How was the incident detected?
- ...
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
- Regardless of the cause of the surge in work, a small backup in builds caused a longer-lived number of CI jobs that were slow to be processed. This hints at a shortage in capability to handle surges of jobs in the CI fleet.
Guidelines
Edited by Cameron McFarland