Skip to content

Migrate shared-runners-manager-X from CoreOS to Google COS

Production Change

Change Summary

We're continuing rollout of Google Container Optimized OS replacement for CoreOS image used by our runners. The detailed description of why it's required can be found at https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12504.

This issue will cover the process of migrating shared-runners-manager-X runners.

Change Details

  1. Services Impacted - ServiceCI Runners
  2. Change Technician - @tmaczukin
  3. Change Reviewer - @steveazz
  4. Time tracking - 5 minutes + 5 * 10 minutes (as main steps will be executed in three subsequent days)
  5. Downtime Component - no downtime expected

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 5 (each time)

Migrate shared-runners-manager-3 runner manager

Migrate shared-runners-manager-4 and shared-runners-manager-5 runner managers

Migrate shared-runners-manager-6 and shared-runners-manager-7 runner managers

Migrate shared-runners-manager-X.staging runner managers

Cleanup roles stack

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 (each time)

  • Confirm that we don't have increase of failures on the Runners (see the Monitoring section bellow)

  • Enter one of the shared runner menagers, switch to root account, list a recently created Docker Machine VM, ssh to it and confirm that it's using Google COS:

    sudo -i
    ls -lsctr ~/.docker/machine/machines
    docker-machine ssh [...] cat /etc/os-release

    Expected entries:

    NAME="Container-Optimized OS"
    ID=cos
    PRETTY_NAME="Container-Optimized OS from Google"

    Unexpected entries:

    NAME="Container Linux by CoreOS"
    ID=coreos

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

(handle in this order!)

Monitoring

Key metrics to observe

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
  • There are currently no active incidents.
Edited by Steve Xuereb