2020-12-09: The shared_runner_queues SLI of the ci-runners service an apdex violating SLO
Summary
A postgresql database slow down caused CI job preperation and pick-up to be slow. This caused a shortfall in running jobs (and long waits), which then turned into a backlog once the database was no longer slow. The backlog after the database slowdown also appeared as jobs waiting to start.
Timeline
All times UTC.
2020-12-03
- 14:52 - CI Runner Apdex begins to fall
- 15:53 - cmcfarland declares incident in Slack.
- 16:39 - CI Runner Apdex recovers
2020-12-09
- 17:56 - EOC gets
The shared_runner_queues SLI of the ci-runners service an apdex violating SLO
alert - 17:58 - cindy declares incident in Slack.
- 18:01 - Alert is resolved
Corrective Actions
Incident Review
Summary
- Service(s) affected: CI Jobs
- Team attribution:
- Time to detection:
- Minutes downtime or degradation: 120 minutes
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Anyone waiting for jobs to start or update status.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Jobs might seem to sit at the pause symbol longer than normal. Jobs might take a while from resolved to show a green checkbox.
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- This is a difficult thing to quantify. The best method I could try to sum this up with is that we saw 8,737 requests go un-answered. 5,648,135 requests were served up during that same time-frame. That's only a %0.15 rate of error response for that request.
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- On-Call Engineer notified of a poor CI Runner APDEX.
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- We did not during the incident. I think that in both incidents, we researched but the problem resolved without intervention.
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Incident Review Stakeholders
Edited by Cameron McFarland