Migrate shared-runners-manager-X from CoreOS to Google COS
Production Change
Change Summary
We're continuing rollout of Google Container Optimized OS replacement for CoreOS image used by our runners. The detailed description of why it's required can be found at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504.
This issue will cover the process of migrating shared-runners-manager-X
runners.
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @tmaczukin
- Change Reviewer - @steveazz
- Time tracking - 5 minutes + 5 * 10 minutes (as main steps will be executed in three subsequent days)
- Downtime Component - no downtime expected
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5
-
Set label changein-progress on this issue -
Merge the srm
role stack cleanup merge request: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/347
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 5 (each time)
Migrate shared-runners-manager-3
runner manager
-
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/330 and wait for the pipeline to apply changes in chef -
Run knife ssh -C1 -afqdn 'roles:gitlab-runner-srm3' -- sudo chef-client
to force configuration update -
Go through post-change check steps -
Set label changescheduled on this issue
Migrate shared-runners-manager-4
and shared-runners-manager-5
runner managers
-
Set label changein-progress on this issue -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/348 and wait for the pipeline to apply changes in chef -
Run knife ssh -C1 -afqdn 'roles:gitlab-runner-srm4 OR roles:gitlab-runner-srm5' -- sudo chef-client
to force configuration update -
Go through post-change check steps -
Set label changescheduled on this issue
Migrate shared-runners-manager-6
and shared-runners-manager-7
runner managers
-
Set label changein-progress on this issue -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/349 and wait for the pipeline to apply changes in chef -
Run knife ssh -C1 -afqdn 'roles:gitlab-runner-srm6 OR roles:gitlab-runner-srm7' -- sudo chef-client
to force configuration update -
Go through post-change check steps -
Set label changescheduled on this issue
Migrate shared-runners-manager-X.staging
runner managers
-
Set label changein-progress on this issue -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/350 and wait for the pipeline to apply changes in chef -
Run knife ssh -C1 -afqdn 'roles:gitlab-runner-stg-srm' -- sudo chef-client
to force configuration update -
Go through post-change check steps -
Set label changescheduled on this issue
Cleanup roles stack
-
Set label changein-progress on this issue -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/351 and wait for the pipeline to apply changes in chef -
Run knife ssh -C1 -afqdn 'roles:gitlab-runner-srm' -- sudo chef-client
to force configuration update -
Go through post-change check steps
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 (each time)
-
Confirm that we don't have increase of failures on the Runners (see the Monitoring section bellow) -
Enter one of the shared runner menagers, switch to root
account, list a recently created Docker Machine VM, ssh to it and confirm that it's using Google COS:sudo -i ls -lsctr ~/.docker/machine/machines docker-machine ssh [...] cat /etc/os-release
Expected entries:
NAME="Container-Optimized OS" ID=cos PRETTY_NAME="Container-Optimized OS from Google"
Unexpected entries:
NAME="Container Linux by CoreOS" ID=coreos
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
(handle in this order!)
-
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/351 -
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/350 -
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/349 -
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/348 -
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/330 -
Wait for the pipeline for last revert to apply changes in chef and run knife ssh -C1 -afqdn 'roles:gitlab-runner-srm' -- sudo chef-client
to force configuration update
Monitoring
Key metrics to observe
-
Metric: Failures on GitLab Inc. runners
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?viewPanel=23&orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared&var-runner_manager=All&var-jobs_running_for_project=0&var-runner_job_failure_reason=All
- What changes to this metric should prompt a rollback: Significant increase in the failures number, especially the
system_runner_failure
andunknown_failure
-
Metric: Autoscaled VMs states, Autoscaling VM operation rate, Autoscaling VMs creation timing
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-autoscaling/ci-runners-incident-support-autoscaling?orgId=1&refresh=1m&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared&var-runner_manager=All&var-jobs_running_for_project=0&var-gcp_exporter=shared-runners-manager-3.gitlab.com:9393&var-gcp_project=All&var-gcp_region=All
- What changes to this metric should prompt a rollback: abnormalities in the autoscaling handling patterns
-
Logs: Error rates on a specific host
- Location: https://log.gprd.gitlab.net/goto/45b156fe54ac4986fc1ea50c90f4bf47
- What changes to this metric should prompt a rollback: Large spike in failures
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.