2020-10-28: (RESOLVED) gitlab-docker-shared-runners-manager-X not available since yesterday

Summary

THE PROBLEM IS ALREADY RESOLVED

Adding to leave the record of reported incident.

As part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11391 we've patched Docker Machine to be able to set specific tags and labels for all autoscaled VMs created by our Runner Managers. Yesterday with the chef-repo merge request https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4341 we've changed the configuration of gitlab-shared-runners-manager-X and gitlab-docker-shared-runners-manager-X managers to start using this new setting.

Unfortunately, while for gsrmX this worked without a problem, it didn't for the gdsrmX managers. The new managers are running in a seperate GCP project where a different set of permissions was added to the service account that the managers have applied. This ended with a VM creation failures happening all of the time, with the following error being raised:

Error creating machine: Error in driver during machine creation: googleapi: Error 403: Required 'compute.instances.setLabels' permission for 'projects/gitlab-org-ci-0d24e2/zones/us-east1-c/instances/runner-pvr9xbdq-org-ci-1603895319-3a720557', forbidden

The error started to be raised just after the new configuration was applied by chef-client on the gdsrmX hosts (at the middle of yesterday).

Please note that the purpose of gdsrmX Runner Managers is to handle Docker-in-Docker jobs for the GitLab product community forks. So while the service was down and - as these are the instance level runners - the managers were not available for the users, the incident was not affecting our regular Shared Runners service that we provide with GitLab.com.

The strange behavior of gdsrmX managers (not picking any jobs) was noticed today and reported around 14:00 UTC and fixed within 1.5 hour by reverting the whole MR (even though change for gsrmX was fully working) and additionally restarting GitLab Runner on the gdsrmX hosts to stop and cleanup any existing Docker Machine VM creations.

Further actions, especially:

how to fix the permission problem,
how to re-enable the configuration,
how to add alerting for a similar cases,

are being discussed outside of this issue.

Timeline

All times UTC.

2020-10-27

16:51 - https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4341 is merged
16:51 to 16:59 - CI jobs applying all changed roles to Chef Server are executed
somewhere after 16:51 (within up to 30 minutes) - chef-client updates the configuration of gdsrmX Runner Managers and within next minute the GitLab Runner process re-reads the configuration file. Since now every new VM creation even is failing because of missing permission

2020-10-28

14:04 - the strange behavior of gdsrmX managers is reported at https://gitlab.slack.com/archives/CB3LSMEJV/p1603893860092100, then raised in the Verify EM Slack channel (https://gitlab.slack.com/archives/C014UTH4W02/p1603893881000300)
14:16 - the issue is brought to the GitLab Runner group Slack channel (https://gitlab.slack.com/archives/CBQ76ND6W/p1603894563239700)
14:34 - the root cause is found in logs
14:43 - the MR that reverts problematic changes is created - https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/4467
14:44 - the revert MR is merged. Chef roles of the problematic hosts are updated on Chef Server
14:45 to 15:27 - configuration on all four hosts is updated; GitLab Runner process is restarted.
15:27 - the problem is resolved. All four problematic gdsrmX Runner Managers are again picking up the jobs.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

For ~24 hours part of our Shared Runners fleet - the gitlab-docker-shared-runners-manager-X managers provided to handle Docker-in-Docker jobs of GitLab products community forks - were not picking up jobs because of a new configuration and missing GCP permissions.

Service(s) affected : CI/CD Runners (the gitlab-docker-shared-runners-manager-X managers)
Team attribution : GitLab Runner team, SREs
Minutes downtime or degradation : ~24-30 hours -> 80k-100k minutes

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)

Internal customers (we use these Runner Managers by ourselves in our jobs), GitLab.com users that are working on community contributions to our products and other GitLab.com users that could use these Runners.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)

For internal customers a reduced CI/CD capacity for a limited number of jobs (using Docker-in-Docker) - apart of these instance runners we're using also group runners at the gitlab-org group level and these were not affected. For external users (both community contributors and others who happen to use these managers) - there was no shared runner that could handle a specific type of the jobs, which most probably blocked whole pipelines on such specific jobs.
How many customers were affected?
If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

How was the event detected?

Reported by a team member at https://gitlab.slack.com/archives/CB3LSMEJV/p1603893860092100, who spotted some support requests reporting this and spotted a community contribution MR that was blocked by a stalled pipeline (with a job stuck because of the missing runner).
How could detection time be improved?

We're discussing a way to detect problems with Docker Machine autoscaled VMs creation that we could measure and create alerting rules based on that.
How did we reach the point where we knew how to mitigate the impact?

Checking logs and understanding what the error means. A way to fix became obvious after that 🙂
How could time to mitigation be improved?

IMHO the problem was mitigated in the right time. In this specific case nothing could be done faster.

Post Incident Analysis

How was the root cause diagnosed?
- checking GitLab.com admin area and confirming that the Runner Managers haven't been in contact since yesterday
- checking the metrics and confirming that there is no strange measurement of the uptime (first idea was that we have a panic failure loop, but the metrics rejected this idea)
- checking logs to see what's happening with the Runner process on the reported Runner Managers
After the log mentioning missing GCP permissions was found, it become obvious what caused the issue.
How could time to diagnosis be improved?

Having the metric mentioned above we could skip the first two steps and immediately know that we have problem with autoscaling. However, this would be not a huge improvement (checking the status of Runner in GitLab.com admin panel and checking the uptime metric took maybe a minute).

Anyway, finding a good way to alert about VM autoscaling problems is definitely a good thing to have.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?

No. We should create an issue for that.
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?

Yes. The connected issue and merge request were linked above.

Timeline

YYYY-MM-DD XX:YY UTC: action X taken
YYYY-MM-DD XX:YY UTC: action Y taken

5 Whys

Lessons Learned

Corrective Actions

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Oct 28, 2020 by Tomasz Maczukin