2021-03-03: The shared_runner_queues SLI of the ci-runners service (`main` stage) has an apdex violating SLO
Summary
From approximately 08:00 to 16:00 there was a significant drop in apdex for the CI runners which is resulting in some CI jobs in the pending state for longer periods than normal.
The root cause of this drop was due to abuse and is further explained in confidential issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12741
Timeline
View recent production deployment and configuration events (internal only)
All times UTC.
2021-03-03
-
09:07- EOC declares incident in Slack. -
09:43- EOC pings the designated engineer from Verify:Runner secondary on-call -
09:48- We suspect a feature flag is contributing to the apdex drop -
09:53- We disable the feature flag, no apdex improvement -
10:03- We find that the usual suspects (GCP quotas, database timings, etc ...) aren't showing anything unusual in the graphs -
10:18- We suspects CI abuse is causing the degradation -
10:28- We find couple of users who are abusing the CI -
10:32- Blocking the users fails from the admin UI -
10:52- We succeed in blocking some users, we find more abusers -
11:02- More abusers are discovered -
11:40- We find an abuser whose activity correlates with the drop in apdex -
12:00- To be able to block users, we decide to block them in a Rails console with patched code to speed up the blocking -
12:15- We compile a list of all the abusers we want to block, we block them eventually -
12:30- We continue to monitor the situation -
14:02- Tomasz is going to take down srm3, cleanup all docker machine leftovers, reconfigure it to create 1000 idle machines -
15:53- srm3 is back in the pool -
16:29- More abusers are found and blocked -
16:34- The service is recovered but we're still monitoring
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- https://gitlab.com/gitlab-org/gitlab/-/issues/323341 (confidential)
- 500 errors when blocking via the UI/API due to pipeline cancellations https://gitlab.com/gitlab-org/gitlab/-/issues/323039, https://gitlab.com/gitlab-org/gitlab/-/issues/301222
- Adjust auto-scaling to avoid hitting api limits https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5107
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
- Service(s) affected: ServiceCI Runners
- Team attribution: Infra, Verify:Runner
- Time to detection: 60 minutes
- Minutes downtime or degradation: 480
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- GitLab.com customers
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- CI jobs were in pending state for a considerably long time
-
How many customers were affected?
- An exact number is hard to come by, but if consider jobs being stuck in pending state for over 5 minutes a criteria for degradation, then we have 13068 affected users. Obtained by this query:
select count(distinct(user_id)) from ci_builds inner join users on users.id = ci_builds.user_id where users.state != 'blocked' and ci_builds.created_at > '2020-03-03 08:00:00' and ci_builds.created_at < '2020-03-03 16:00:00' and (ci_builds.started_at - ci_builds.created_at) > '5 minutes'::interval; -
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- N/A
What were the root causes?
- CI abuse from many users
Incident Response Analysis
-
How was the incident detected?
- An alert was triggered.
-
How could detection time be improved?
- Perhaps bumping the apdex threshold to something above 28%, as we had a degradation for over an hour and we didn't catch it because it didn't go below 28%
-
How was the root cause diagnosed?
- By exclusion of usual suspects (GCP quotas, database timings, ...) we looked at namespaces that scheduled a large number of jobs.
-
How could time to diagnosis be improved?
- Better alerting into this kind of abuse, like new users creating unusual number of jobs.
-
How did we reach the point where we knew how to mitigate the impact?
- Once the root cause was known, finding abusers and blocking them was relatively easy.
-
How could time to mitigation be improved?
- N/A
-
What went well?
- collaboration with the Verify team was excellent as usual.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Yes
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- N/A
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No.
Lessons Learned
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
