Incident Review: CiRunnersServiceQueuingQueriesDurationApdexSLOViolation
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics - If there is a need to schedule a synchronous review, complete the following steps:
-
In this issue, @
mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue. -
Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it. -
Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.
-
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- One external customer.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- High queuing time for all the jobs under the customer's group.
-
How many customers were affected?
- One.
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- One customer, over 600K delayed jobs.
What were the root causes?
Incident Response Analysis
-
How was the incident detected?
- Automated tooling, monitoring page: CiRunnersServiceQueuingQueriesDurationApdexSLOViolation.
-
How could detection time be improved?
- NA
-
How was the root cause diagnosed?
- Firstly, direct DB queries were executed to determine the source of the pending jobs
👉 #17724 (comment 1816073131). - It was then determined that only one group was affected.
- @cms pin points the source security policy, and a edit time correlating with the initial alert.
- Firstly, direct DB queries were executed to determine the source of the pending jobs
-
How could time to diagnosis be improved?
- Diagnosis time was
March-15-2024 10:46 UTC
, around 14 hours after initial alert. - ...
- Diagnosis time was
-
How did we reach the point where we knew how to mitigate the impact?
- The user's runner was paused by SRE at
15-03-2024 09:06 UTC
. - We waited for an hour expecting stuck builds to be dropped, this didn't happen.
- The team proceeded with generating a manual script to drop the pending builds, it was estimated to take around 7 hours to clean the queue.
- Meanwhile, we asked the customer to disable the problematic feature.
- The manual script started executing around 15:30 by EOC, authored by @vshushlin and improved by @mbobin.
- Around 15:56 UTC the user confirmed disabling the offending policy.
- The user's runner was paused by SRE at
-
How could time to mitigation be improved?
- Project/Group level cancel all pipelines option
👉 Cancel all pipelines for a project (gitlab-org/gitlab#16259)
- Project/Group level cancel all pipelines option
Post Incident Analysis
- Did we have other events in the past with the same root cause?
- Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- The incident was triggered by Naive workload. RootCauseNaive-Traffic.
What went well?
- Great inter-teams collaboration.
- The root cause was explained in detail in a previous context by @tmaczukin, and the related stage group was already working on the issue.
- User impact was very limited, mainly due to having the workload affect one table, also the project long-term impact was minimal as the CI DB is separate from the main DB.
Guidelines
Edited by Rehab