Upgrade base image for private runners to COS 93 LTS
Production Change
Change Summary
We're currently using Google Container Optimized OS 85 LTS. Support for it will be ended in December.
As per https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13905, we're going to migrate to the newest long term support version - 93 LTS - which will be supported to October 2023, so gives us almost next two years of support.
As the first step, we will upgrade the private
shard of our ServiceCI Runners fleet. In case of problems it would affect only us. Further steps will be handled in another change management issue.
The upgrade for private
will be done by merging the merge request that updates the version of COS and builds a new beta
version of our base VM image. As our private
runners are configured to use this base image, it will automatically start using the new image at the moment when it will be built and published.
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @tmaczukin
- Change Reviewer - @akohlbecker
- Time tracking - 65 min
- Downtime Component - No downtime expected
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 1 min
-
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 min
-
Merge https://dev.gitlab.org/cookbooks/packer-runner-machines/-/merge_requests/46 -
Wait for the image to be built and published
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 4 min
-
Wait for a new ephemeral runner to be created using the new image -
SSH to the ephemeral runner and check if all needed services are working -
Confirm that we don't see any unexpected errors counted on the dashboards nor in the logs
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 min
Monitoring
Key metrics to observe
- Metric:
- Location: CI Runners Incident Support: autoscaling
- What changes to this metric should prompt a rollback:
- Suspicious graphs of the autoscaling panels
- Metric:
- Location: CI Runners Incident Support: runner manager
- What changes to this metric should prompt a rollback:
- Elevated number of detected job failures
- Suspicious graph of system resources
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.