Incident Review: SaaS runners growing pending jobs queue

Incident Review

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
- Incident Review Agenda 👉 internal link.

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Users who configured their jobs to use the saas-linux-small-amd64 runners.
2. https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. A delay in pipeline jobs execution.
2. A delay in MRs merge and pipeline verification.
How many customers were affected?
1. It's tricky to say how many users were affected.
2. Instead, we were able to quantify the impact on the project level.
3. 29,441 projects were affected.

Over-optimized autoscaling configs, causing a hidden issue of intermittent quota breaches in GCP.
Exacerbated by a deployment.

How was the incident detected?
1. EOC happened to have a job stuck: https://gitlab.com/gitlab-cookbooks/gitlab-server/-/jobs/4931806786
2. EOC manually checked the metrics, then it was clear there was wider impact.
How could detection time be improved?
1. An alert of growing pending queue.
How was the root cause diagnosed?
1. sshing into the enabled fleet and observing the errors.
How could time to diagnosis be improved?
1. I think we were able to quickly figure out the root cause.
How did we reach the point where we knew how to mitigate the impact?
1. We followed different paths of trial and error.
2. First we eliminated that the recent deployment was related.
3. Second we brought the whole fleet to a halt, to stop the quota breach errors.
4. Third we adjusted the Autoscaling configs to slow down the scale up of the fleet.
5. Only then we started to slowly recover with no quota breach errors.
How could time to mitigation be improved?
1. Having 2 separate GCP projects for each fleet.
  - Currently we only have one GCP project for each fleet.
  - We have two runner-manager VMs for each fleet, sharing one GCP project.

Did we have other events in the past with the same root cause?
1. Maybe, at least not in the near past.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No, but corrective action's were created.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. yes, a deployment https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3910.

Matter experts were present from the beginning to the end of the incident, thanks @tmaczukin!
We have learned a bit more about the product and how it interacts with the infrastructure.
This affected only 1 shard out of 5.

Edited Aug 28, 2023 by Rehab