Incident Review: CiRunnersServiceQueuingQueriesDurationApdexSLOViolation

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
If there is a need to schedule a synchronous review, complete the following steps:
- In this issue, @ mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue.
- Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it.
- Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.

Who was impacted by this incident? (i.e. external customers, internal customers)
1. One external customer.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. High queuing time for all the jobs under the customer's group.
How many customers were affected?
1. One.
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. One customer, over 600K delayed jobs.

How was the incident detected?
1. Automated tooling, monitoring page: CiRunnersServiceQueuingQueriesDurationApdexSLOViolation.
How could detection time be improved?
1. NA
How was the root cause diagnosed?
1. Firstly, direct DB queries were executed to determine the source of the pending jobs 👉 #17724 (comment 1816073131).
2. It was then determined that only one group was affected.
3. @cms pin points the source security policy, and a edit time correlating with the initial alert.
How could time to diagnosis be improved?
1. Diagnosis time was March-15-2024 10:46 UTC, around 14 hours after initial alert.
2. ...
How did we reach the point where we knew how to mitigate the impact?
1. The user's runner was paused by SRE at 15-03-2024 09:06 UTC.
2. We waited for an hour expecting stuck builds to be dropped, this didn't happen.
3. The team proceeded with generating a manual script to drop the pending builds, it was estimated to take around 7 hours to clean the queue.
4. Meanwhile, we asked the customer to disable the problematic feature.
5. The manual script started executing around 15:30 by EOC, authored by @vshushlin and improved by @mbobin.
6. Around 15:56 UTC the user confirmed disabling the offending policy.
How could time to mitigation be improved?
1. Project/Group level cancel all pipelines option 👉 Cancel all pipelines for a project (gitlab-org/gitlab#16259)

Did we have other events in the past with the same root cause?
1. https://gitlab.com/gitlab-org/ci-cd/shared-runners/infrastructure/-/issues/190.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Yes 👉 https://gitlab.com/groups/gitlab-org/-/epics/12033.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. The incident was triggered by Naive workload. RootCauseNaive-Traffic.

Great inter-teams collaboration.
The root cause was explained in detail in a previous context by @tmaczukin, and the related stage group was already working on the issue.
User impact was very limited, mainly due to having the workload affect one table, also the project long-term impact was minimal as the CI DB is separate from the main DB.

Edited May 13, 2024 by Rehab