Migrate docker-shared-runners-manager-X from CoreOS to Google COS
Production Change
Change Summary
We're continuing rollout of Google Container Optimized OS replacement for CoreOS image used by our runners. The detailed description of why it's required can be found at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504.
This issue will cover the process of migrating gitlab-docker-shared-runners-manager-X
runners and fully finalize the rollout.
Change Details
- Services Impacted - ServiceCI Runners
-
Change Technician -
@tmaczukin
- Change Reviewer - @steveazz
- Time tracking - 16 minutes
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5
-
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/386 and wait for the pipeline to apply changes in chef -
Run knife ssh -C1 -afqdn 'roles:org-ci-base-runner' -- sudo chef-client
to force configuration update
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5
-
Confirm that we don't have increase of failures on the Runners (see the Monitoring section bellow) -
Enter one of the runner menagers, switch to root
account, list a recently created Docker Machine VM, ssh to it and confirm that it's using Google COS:sudo -i ls -lsctr ~/.docker/machine/machines docker-machine ssh [...] cat /etc/os-release
Expected entries:
NAME="Container-Optimized OS" ID=cos PRETTY_NAME="Container-Optimized OS from Google"
Unexpected entries:
NAME="Container Linux by CoreOS" ID=coreos
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
(handle in this order!)
-
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/387 -
Wait for the pipeline for last revert to apply changes in chef and run knife ssh -C1 -afqdn 'roles:org-ci-base-runner' -- sudo chef-client
to force configuration update
Monitoring
Key metrics to observe
- Metric: Failures on GitLab Inc. runners
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?viewPanel=23&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared-gitlab-org&var-runner_manager=gitlab-docker-shared-runners-manager-01.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-02.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-03.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-04.gitlab.com.&var-jobs_running_for_project=0&var-runner_job_failure_reason=All
- What changes to this metric should prompt a rollback: Significant increase in the failures number, especially the
system_runner_failure
andunknown_failure
- Metric: Autoscaled VMs states, Autoscaling VM operation rate, Autoscaling VMs creation timing
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-autoscaling/ci-runners-incident-support-autoscaling?orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared-gitlab-org&var-runner_manager=gitlab-docker-shared-runners-manager-04.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-03.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-02.gitlab.com.&var-runner_manager=gitlab-docker-shared-runners-manager-01.gitlab.com.&var-jobs_running_for_project=0&var-gcp_exporter=shared-runners-manager-3.gitlab.com:9393&var-gcp_project=All&var-gcp_region=All
- What changes to this metric should prompt a rollback: abnormalities in the autoscaling handling patterns
- Logs: Error rates on a specific host
- Location: https://log.gprd.gitlab.net/goto/ff8e6844dec5f0b302d98741b7d7cc9b
- What changes to this metric should prompt a rollback: Large spike in failures
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.