2021-03-03: The shared_runner_queues SLI of the ci-runners service (`main` stage) has an apdex violating SLO

Summary

From approximately 08:00 to 16:00 there was a significant drop in apdex for the CI runners which is resulting in some CI jobs in the pending state for longer periods than normal.

The root cause of this drop was due to abuse and is further explained in confidential issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12741

Timeline

View recent production deployment and configuration events (internal only)

All times UTC.

2021-03-03

09:07 - EOC declares incident in Slack.
09:43 - EOC pings the designated engineer from Verify:Runner secondary on-call
09:48 - We suspect a feature flag is contributing to the apdex drop
09:53 - We disable the feature flag, no apdex improvement
10:03 - We find that the usual suspects (GCP quotas, database timings, etc ...) aren't showing anything unusual in the graphs
10:18 - We suspects CI abuse is causing the degradation
10:28 - We find couple of users who are abusing the CI
10:32 - Blocking the users fails from the admin UI
10:52 - We succeed in blocking some users, we find more abusers
11:02 - More abusers are discovered
11:40 - We find an abuser whose activity correlates with the drop in apdex
12:00 - To be able to block users, we decide to block them in a Rails console with patched code to speed up the blocking
12:15 - We compile a list of all the abusers we want to block, we block them eventually
12:30 - We continue to monitor the situation
14:02 - Tomasz is going to take down srm3, cleanup all docker machine leftovers, reconfigure it to create 1000 idle machines
15:53 - srm3 is back in the pool
16:29 - More abusers are found and blocked
16:34 - The service is recovered but we're still monitoring

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

https://gitlab.com/gitlab-org/gitlab/-/issues/323341 (confidential)
500 errors when blocking via the UI/API due to pipeline cancellations https://gitlab.com/gitlab-org/gitlab/-/issues/323039, https://gitlab.com/gitlab-org/gitlab/-/issues/301222
Adjust auto-scaling to avoid hitting api limits https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5107

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Service(s) affected: ServiceCI Runners
Team attribution: Infra, Verify:Runner
Time to detection: 60 minutes
Minutes downtime or degradation: 480

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. GitLab.com customers
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. CI jobs were in pending state for a considerably long time

How many customers were affected?

An exact number is hard to come by, but if consider jobs being stuck in pending state for over 5 minutes a criteria for degradation, then we have 13068 affected users. Obtained by this query:

select count(distinct(user_id)) from ci_builds inner join users on users.id = ci_builds.user_id where users.state != 'blocked' and ci_builds.created_at > '2020-03-03 08:00:00' and ci_builds.created_at < '2020-03-03 16:00:00' and (ci_builds.started_at - ci_builds.created_at) > '5 minutes'::interval;

If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. N/A

What were the root causes?

CI abuse from many users

Incident Response Analysis

How was the incident detected?
1. An alert was triggered.
How could detection time be improved?
1. Perhaps bumping the apdex threshold to something above 28%, as we had a degradation for over an hour and we didn't catch it because it didn't go below 28%
How was the root cause diagnosed?
1. By exclusion of usual suspects (GCP quotas, database timings, ...) we looked at namespaces that scheduled a large number of jobs.
How could time to diagnosis be improved?
1. Better alerting into this kind of abuse, like new users creating unusual number of jobs.
How did we reach the point where we knew how to mitigate the impact?
1. Once the root cause was known, finding abusers and blocking them was relatively easy.
How could time to mitigation be improved?
1. N/A
What went well?
1. collaboration with the Verify team was excellent as usual.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Yes
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. N/A
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No.

Lessons Learned

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Mar 09, 2021 by Ahmad Sherif