Incident Review: Windows Shared Runners are offline
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics - If there is a need to schedule a synchronous review, complete the following steps:
-
In this issue, @
mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue. -
Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it. -
Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.
-
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- customers using windows shared runners
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- windows CI jobs were not running
-
How many customers were affected?
- we don't have this metric
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- overall number of windows CI jobs is low (https://thanos.gitlab.net/graph?g0.expr=sum(gitlab_runner_jobs%7Benv%3D%22gprd%22%2C%20shard%3D%22windows-shared%22%2C%20state%3D%22running%22%7D)&g0.tab=0&g0.stacked=0&g0.range_input=2d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D&g0.end_input=2024-04-10%2009%3A50%3A19&g0.moment_input=2024-04-10%2009%3A50%3A19)
What were the root causes?
- the runner service didn't start after server restart - #17808 (comment 1854943878)
Incident Response Analysis
-
How was the incident detected?
- it was reported by a customer
-
How could detection time be improved?
- add alerting for this service
-
How was the root cause diagnosed?
- by checking system logs on the windows server
-
How could time to diagnosis be improved?
- add status monitoring for runner manager service
-
How did we reach the point where we knew how to mitigate the impact?
- SREs with knowledge of CI runners were able to restart the service.
- How could time to mitigation be improved? 1.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- no
What went well?
- With help from other SREs we were able to quickly identify the problem and also root cause of the issue.
Guidelines
Edited by Jan Provaznik