2020-05-14: Degraded performance on shared CI runners
Degraded performance on shared CI runners. Jobs might take a long time to start up.
All times UTC.
- 11:12 - Incident declared by t4cc0re in Slack via
- 13:30 - Decision to shift shared-runners-manager-3 and 5 to use alternate A-Z
- 14:20 - shared-runners-manager-3 starts to use us-east-1c, still waiting on jobs for manager-5 to finish in 1d before it can shift
Click to expand or collapse the Incident Review section.
- Service(s) affected :
- Team attribution :
- Minutes downtime or degradation :
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)?
- YYYY-MM-DD XX:YY UTC: action X taken
- YYYY-MM-DD XX:YY UTC: action Y taken
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)