Migrate docker-shared-runners-manager-X from CoreOS to Google COS

Production Change

Change Summary

We're continuing rollout of Google Container Optimized OS replacement for CoreOS image used by our runners. The detailed description of why it's required can be found at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504.

This issue will cover the process of migrating gitlab-docker-shared-runners-manager-X runners and fully finalize the rollout.

Change Details

Services Impacted - ServiceCI Runners
Change Technician - @tmaczukin
Change Reviewer - @steveazz
Time tracking - 16 minutes
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 1

Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 5

Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/386 and wait for the pipeline to apply changes in chef
Run knife ssh -C1 -afqdn 'roles:org-ci-base-runner' -- sudo chef-client to force configuration update

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5

Confirm that we don't have increase of failures on the Runners (see the Monitoring section bellow)

Enter one of the runner menagers, switch to root account, list a recently created Docker Machine VM, ssh to it and confirm that it's using Google COS:

sudo -i
ls -lsctr ~/.docker/machine/machines
docker-machine ssh [...] cat /etc/os-release

Expected entries:

NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"

Unexpected entries:

NAME="Container Linux by CoreOS"
ID=coreos

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

(handle in this order!)

Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/387
Wait for the pipeline for last revert to apply changes in chef and run
```
knife ssh -C1 -afqdn 'roles:org-ci-base-runner' -- sudo chef-client
```
to force configuration update

Monitoring

Key metrics to observe

Metric: Failures on GitLab Inc. runners
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?viewPanel=23&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared-gitlab-org&var-runner_manager=gitlab-docker-shared-runners-manager-01.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-02.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-03.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-04.gitlab.com.&var-jobs_running_for_project=0&var-runner_job_failure_reason=All
- What changes to this metric should prompt a rollback: Significant increase in the failures number, especially the system_runner_failure and unknown_failure
Metric: Autoscaled VMs states, Autoscaling VM operation rate, Autoscaling VMs creation timing
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-autoscaling/ci-runners-incident-support-autoscaling?orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared-gitlab-org&var-runner_manager=gitlab-docker-shared-runners-manager-04.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-03.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-02.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-01.gitlab.com.&var-jobs_running_for_project=0&var-gcp_exporter=shared-runners-manager-3.gitlab.com:9393&var-gcp_project=All&var-gcp_region=All
- What changes to this metric should prompt a rollback: abnormalities in the autoscaling handling patterns
Logs: Error rates on a specific host
- Location: https://log.gprd.gitlab.net/goto/ff8e6844dec5f0b302d98741b7d7cc9b
- What changes to this metric should prompt a rollback: Large spike in failures

Summary of infrastructure changes

~~Does this change introduce new compute instances?~~
~~Does this change re-size any existing compute instances?~~
~~Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?~~

Changes checklist

Edited Aug 17, 2021 by Steve Xuereb