Degraded performance on shared CI runners. Jobs might take a long time to start up.
Timeline
All times UTC.
2020-05-15
11:12 - Incident declared by t4cc0re in Slack via /incident declare command.
13:30 - Decision to shift shared-runners-manager-3 and 5 to use alternate A-Z
14:20 - shared-runners-manager-3 starts to use us-east-1c, still waiting on jobs for manager-5 to finish in 1d before it can shift
Click to expand or collapse the Incident Review section.
Incident Review
Service(s) affected :
Team attribution :
Minutes downtime or degradation :
Metrics
Customer Impact
Who was impacted by this incident? (i.e. external customers, internal customers)
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
How many customers were affected?
If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
How was the event detected?
How could detection time be improved?
How did we reach the point where we knew how to mitigate the impact?
How could time to mitigation be improved?
Post Incident Analysis
How was the root cause diagnosed?
How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. if yes, have you linked the issue which represents the change?)?
It was already taken by the service appdex alert. This is rather a good candidate to be putted into the runbook - to check how these graphs look on a "strange and unknown" pipeline queue timings problems. Autoscaling problems directly affect our ability to take the jobs from the queue, so looking here and being able to "detect" the known problems is usable.
Take also a look on the graph Machines creation timing in the row bellow. Also a good information about "something going wrong with our autoscaling"
I've stopped chef-client and manually reconfigured shared-runners-manager-3.gitlab.com and shared-runners-manager-5.gitlab.com to use the us-east1-c zone for machines scheduling.
We have still a lot of Idle machines created by the previous configuration.
Both srm3 and srm5 are now paused on GitLab.com. We should execute /root/runner_upgrade.sh on both of them and unpause the runners in GitLab when it will be done.
@steveazz Is running the /root/runner_upgrade.sh on both srm3 and srm5 and they are now unpaused in GitLab.com. When the graceful restart will be done, Runners will start getting new jobs again and using us-east1-c.
We are now waiting for shared runners to finish destroying in the troubled zone. As they finish and get cleaned up, the us-east1-c jobs will pick up and start to return things to a better state.
@T4cC0re We have gitlab-shared-runners-manager-3.gitlab.com that is spawning machines in us-east1-d that we should pause for now to stop it from creating machines.
@T4cC0reshared-runners-manager-3.gitlab.com is drained and all new machines will be created in us-east1-c if you can unpause it in the GitLab UI we should start see some jobs being scheduled to it.
We are now beginning to see latency issues again with machine removal in the region to which we had moved machines. We plan to move 50% of machine creation to us-east-1b to see if moving all of the load out of us-east-1c.
One of my pipelines is stuck on the pages deploy job. I canceled the previous job and retried the deploy, but the job still doesn't seem to complete. Could this be tied to the incident?
pages:deploy is a "virtual" job that doesn't use runners at all. It's handled by the GitLab Pages daemon. So while there is maybe some problem with it, it's definitely not caused by this incident.
@steveazz - no I'm going to schedule a sync review hopefully for next week. Another IR took the slot this week. I think @T4cC0re was going to start and incident review on this issue soon.
This incident was closed before the IncidentReview-Completed label
was applied. The issue is being reopened so that it will appear on the
Production Incidents
board—and can be moved through the entire incident management workflow.
Please review the Incident Workflow
section on the Incident Management handbook page for more information.