2025-08-27: macos runners update to 18.3.0~pre.128.g86b6c639-1 and host images upgrade
Production Change
Change Summary
This upgrades the runner version for inactive macOS runners to 18.3.0~pre.128.g86b6c639-1
. The macOS host image is also upgraded here, which has already been merged via https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/11856.
MR to be reviewed: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/6353+
We tried 18.4 previously but this was aborted due to incompatibilities of the caching command. See also #20430 (comment 2712197274)
See also https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27408+
Upgrades these inactive shards for macos runners:
Shard Name | Inactive deployment color |
---|---|
saas-macos-staging |
|
saas-macos-medium-m1 |
|
saas-macos-large-m2pro |
|
Change Details
- Services Impacted - ServiceCI Runners
-
Change Technician -
@joe-shaw
- Change Reviewer - @rehab
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-08-27 19:00
- Time tracking - 90 minutes
- Downtime Component - none
Set Maintenance Mode in GitLab
If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
Detailed steps for the change
Pre-execution steps
-
Make sure all tasks in Change Technician checklist are done -
For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production
channel, mention@sre-oncall
and this issue and await their acknowledgement.)-
The SRE on-call provided approval with the eoc_approved label on the issue.
-
-
For C1, C2, or blocks deployments change issues, Release managers have been informed prior to change being rolled out. (In #production
channel, mention@release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents that are severity1 or severity2 -
If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Change steps - steps to take to execute the change
-
Set label changein-progress /label ~change::in-progress
-
Check that the m2pro AWS environment (acc 730335264460) has enough dedicated hosts to scale into. Ideally there are enough spare hosts to handle the current job load. - You should already have access to the AWS account for macOS runners: navigate to
us-east-1
->EC2
->Dedicated Hosts
. There should be at least 4 "empty" hosts there ready to be used.
- You should already have access to the AWS account for macOS runners: navigate to
-
Merge MR to upgrade the gitlab-runner version on the inactive shards: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/6353 -
Shift the traffic to the inactive fleet. -
Inform the EOC via #production
@sre-oncall I'm going to shift traffic to the upgraded deployments for the Macos Runners shards `saas-macos-large-m2pro`, `saas-macos-medium-m1` and `saas-macos-staging`. Details in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20430.
saas-macos-staging/green
-
Start saas-macos-staging/green
via#production
/runner run start saas-macos-staging green
-
Wait for new deployment to become active (check the dashboard). -
Run a staging pipeline to verify the runner version works properly - https://gitlab.com/gitlab-org/ci-cd/tests/saas-runners-tests/macos-platform/saas-macos-staging-basic-test -
Confirm saas-macos-staging/green
started accepting new jobs -
Stop saas-macos-staging/blue
(double check the previously active color) via#production
/runner run stop saas-macos-staging blue
-
saas-macos-medium-m1/blue
-
Start saas-macos-medium-m1/blue
via#production
/runner run start saas-macos-medium-m1 blue
-
Wait for new deployments to start executing jobs (check the dashboard). -
Confirm saas-macos-medium-m1/blue
started accepting new jobs -
Stop saas-macos-medium-m1/green
(double check the previously active color) via#production
/runner run stop saas-macos-medium-m1 green
-
saas-macos-large-m2pro/green
-
Start saas-macos-large-m2pro/green
via#production
/runner run start saas-macos-large-m2pro green
-
Wait for new deployment to start executing jobs (check the dashboard). -
Confirm saas-macos-large-m2pro/green
started accepting new jobs -
Stop saas-macos-large-m2pro/blue
(double check the previously active color) via#production
/runner run stop saas-macos-large-m2pro blue
-
-
-
Inform EOC that the procedure is finished. In the thread started at #production
with the first message post:@sre-oncall This has been completed. New colors are taking the jobs. Previous colors are in the draining state and will be completely drained within the next 2-3 hours.
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (60 mins)
-
Revert linked MRs -
Reverse the rollout steps above. If necessary, start any stopped deployments and stop newly started deployments. -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
Metric: CI Runners Overview
- Location: CI runners deployment overviews for the macos shards.
- Should show new runners come online and start running jobs before older shards are stopped.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
The change plan is technically accurate. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
The change execution window respects the Production Change Lock periods. -
For C1 and C2 change issues, the change event is added to the GitLab Production calendar. -
For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagers
in this issue to request approval and provide visibility to all infrastructure managers. -
For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #production
channel, mention@release-managers
and this issue and await their acknowledgment.)