Incident Review: SaaS runners growing pending jobs queue
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics - Incident Review Agenda
👉 internal link.
- Incident Review Agenda
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Users who configured their jobs to use the
saas-linux-small-amd64
runners. - https://docs.gitlab.com/ee/ci/pipelines/cicd_minutes.html
- Users who configured their jobs to use the
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- A delay in pipeline jobs execution.
- A delay in MRs merge and pipeline verification.
-
How many customers were affected?
- It's tricky to say how many users were affected.
- Instead, we were able to quantify the impact on the project level.
- 29,441 projects were affected.
What were the root causes?
Root cause
- One failed runner-manager VM, when this happens, we usually hammer GCP's API.
Co-factors
- Over-optimized autoscaling configs, causing a hidden issue of intermittent quota breaches in GCP.
- Exacerbated by a deployment.
Incident Response Analysis
-
How was the incident detected?
- EOC happened to have a job stuck: https://gitlab.com/gitlab-cookbooks/gitlab-server/-/jobs/4931806786
- EOC manually checked the metrics, then it was clear there was wider impact.
-
How could detection time be improved?
- An alert of growing pending queue.
-
How was the root cause diagnosed?
- sshing into the enabled fleet and observing the errors.
-
How could time to diagnosis be improved?
- I think we were able to quickly figure out the root cause.
-
How did we reach the point where we knew how to mitigate the impact?
- We followed different paths of trial and error.
- First we eliminated that the recent deployment was related.
- Second we brought the whole fleet to a halt, to stop the quota breach errors.
- Third we adjusted the Autoscaling configs to slow down the scale up of the fleet.
- Only then we started to slowly recover with no quota breach errors.
-
How could time to mitigation be improved?
- Having 2 separate GCP projects for each fleet.
- Currently we only have one GCP project for each fleet.
- We have two runner-manager VMs for each fleet, sharing one GCP project.
- Having 2 separate GCP projects for each fleet.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Maybe, at least not in the near past.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No, but corrective action's were created.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- yes, a deployment https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3910.
What went well?
- Matter experts were present from the beginning to the end of the incident, thanks @tmaczukin!
- We have learned a bit more about the product and how it interacts with the infrastructure.
- This affected only 1 shard out of 5.
Guidelines
Edited by Rehab