2020-10-28: (RESOLVED) gitlab-docker-shared-runners-manager-X not available since yesterday
Summary
THE PROBLEM IS ALREADY RESOLVED
Adding to leave the record of reported incident.
As part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11391 we've patched Docker Machine to be able to set specific tags and labels for all autoscaled VMs created by our Runner Managers. Yesterday with the chef-repo
merge request https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4341 we've changed the configuration of gitlab-shared-runners-manager-X
and gitlab-docker-shared-runners-manager-X
managers to start using this new setting.
Unfortunately, while for gsrmX
this worked without a problem, it didn't for the gdsrmX
managers. The new managers are running in a seperate GCP project where a different set of permissions was added to the service account that the managers have applied. This ended with a VM creation failures happening all of the time, with the following error being raised:
Error creating machine: Error in driver during machine creation: googleapi: Error 403: Required 'compute.instances.setLabels' permission for 'projects/gitlab-org-ci-0d24e2/zones/us-east1-c/instances/runner-pvr9xbdq-org-ci-1603895319-3a720557', forbidden
The error started to be raised just after the new configuration was applied by chef-client
on the gdsrmX
hosts (at the middle of yesterday).
Please note that the purpose of gdsrmX
Runner Managers is to handle Docker-in-Docker jobs for the GitLab product community forks. So while the service was down and - as these are the instance level runners - the managers were not available for the users, the incident was not affecting our regular Shared Runners
service that we provide with GitLab.com.
The strange behavior of gdsrmX
managers (not picking any jobs) was noticed today and reported around 14:00 UTC and fixed within 1.5 hour by reverting the whole MR (even though change for gsrmX
was fully working) and additionally restarting GitLab Runner on the gdsrmX
hosts to stop and cleanup any existing Docker Machine VM creations.
Further actions, especially:
- how to fix the permission problem,
- how to re-enable the configuration,
- how to add alerting for a similar cases,
are being discussed outside of this issue.
Timeline
All times UTC
.
2020-10-27
-
16:51
- https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4341 is merged -
16:51
to16:59
- CI jobs applying all changed roles to Chef Server are executed - somewhere after
16:51
(within up to 30 minutes) -chef-client
updates the configuration ofgdsrmX
Runner Managers and within next minute the GitLab Runner process re-reads the configuration file. Since now every new VM creation even is failing because of missing permission
2020-10-28
-
14:04
- the strange behavior ofgdsrmX
managers is reported at https://gitlab.slack.com/archives/CB3LSMEJV/p1603893860092100, then raised in the Verify EM Slack channel (https://gitlab.slack.com/archives/C014UTH4W02/p1603893881000300) -
14:16
- the issue is brought to the GitLab Runner group Slack channel (https://gitlab.slack.com/archives/CBQ76ND6W/p1603894563239700) -
14:34
- the root cause is found in logs -
14:43
- the MR that reverts problematic changes is created - https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4467 -
14:44
- the revert MR is merged. Chef roles of the problematic hosts are updated on Chef Server -
14:45
to15:27
- configuration on all four hosts is updated; GitLab Runner process is restarted. -
15:27
- the problem is resolved. All four problematicgdsrmX
Runner Managers are again picking up the jobs.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
For ~24 hours part of our Shared Runners fleet - the gitlab-docker-shared-runners-manager-X
managers provided to handle Docker-in-Docker jobs of GitLab products community forks - were not picking up jobs because of a new configuration and missing GCP permissions.
- Service(s) affected : CI/CD Runners (the
gitlab-docker-shared-runners-manager-X
managers) - Team attribution : GitLab Runner team, SREs
- Minutes downtime or degradation : ~24-30 hours -> 80k-100k minutes
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
Internal customers (we use these Runner Managers by ourselves in our jobs), GitLab.com users that are working on community contributions to our products and other GitLab.com users that could use these Runners.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
For internal customers a reduced CI/CD capacity for a limited number of jobs (using Docker-in-Docker) - apart of these instance runners we're using also group runners at the
gitlab-org
group level and these were not affected. For external users (both community contributors and others who happen to use these managers) - there was no shared runner that could handle a specific type of the jobs, which most probably blocked whole pipelines on such specific jobs. -
How many customers were affected?
-
If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
-
How was the event detected?
Reported by a team member at https://gitlab.slack.com/archives/CB3LSMEJV/p1603893860092100, who spotted some support requests reporting this and spotted a community contribution MR that was blocked by a stalled pipeline (with a job stuck because of the missing runner).
-
How could detection time be improved?
We're discussing a way to detect problems with Docker Machine autoscaled VMs creation that we could measure and create alerting rules based on that.
-
How did we reach the point where we knew how to mitigate the impact?
Checking logs and understanding what the error means. A way to fix became obvious after that
🙂 -
How could time to mitigation be improved?
IMHO the problem was mitigated in the right time. In this specific case nothing could be done faster.
Post Incident Analysis
-
How was the root cause diagnosed?
- checking GitLab.com admin area and confirming that the Runner Managers haven't been in contact since yesterday
- checking the metrics and confirming that there is no strange measurement of the uptime (first idea was that we have a panic failure loop, but the metrics rejected this idea)
- checking logs to see what's happening with the Runner process on the reported Runner Managers
After the log mentioning missing GCP permissions was found, it become obvious what caused the issue.
-
How could time to diagnosis be improved?
Having the metric mentioned above we could skip the first two steps and immediately know that we have problem with autoscaling. However, this would be not a huge improvement (checking the status of Runner in GitLab.com admin panel and checking the uptime metric took maybe a minute).
Anyway, finding a good way to alert about VM autoscaling problems is definitely a good thing to have.
-
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
No. We should create an issue for that.
-
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
Yes. The connected issue and merge request were linked above.
Timeline
- YYYY-MM-DD XX:YY UTC: action X taken
- YYYY-MM-DD XX:YY UTC: action Y taken
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)