Resize shared-runners-manager-X GCP instances
Production Change
Change Summary
The shared-runners-manager-X instances are frequently saturating the local resources, especially the CPU. This is limiting us on how many concurrent jobs the managers can handle.
For that purpose we've decided we will resize the instances from the currently used mix of different custom types to use e2-highcpu-16 consistently across all srmX Runner Manager instances.
As this change requires an instance shutdown, the operation needs to be done with the usage of Graceful Shutdown procedure and instance-by-instance, with only one instance being terminated at once.
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @tmaczukin, @igorwwwwwwwwwwwwwwwwwwww
- Change Criticality - C3
- Change Type - changescheduled
- Change Reviewer - @steveazz
- Due Date - 2021-03-16, starting at 10:00 UTC
-
Time tracking - Rough estimate is ~10-12 hours for all 5
srmXmachines. The change doesn't assume a rollback steps per se (described bellow). -
Downtime Component - Every
srmXinstance will be down for roughly 2-2.5 hours (time needed for Graceful Shutdown and for adjusting instance settings and restarting the VM). ServiceCI Runners should not be disrupted within this time as only one manager will be taken out of a pool at a time.
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
-
Make sure that you meet Administrator prerequisites before you will start any work. -
Not in a PCL time window. -
Check that you have administrative access to the gitlab-ciproject in GCP console. -
Check https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?viewPanel=79474957&orgId=1&from=now-3h&to=now and confirm that the ci-runners Service Apdexscore is above the SLO limit before starting the change rollout.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 120-150 minutes (2-2.5 hours) per instance. 600-750minutes (10-12.5 hours) in total (with 5 instances being updated 1-by-1)
-
shared-runners-manager-3-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-3instance to usee2-highcpu-16machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
-
-
shared-runners-manager-4-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-4instance to usee2-highcpu-16machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
-
-
shared-runners-manager-5-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-5instance to usee2-highcpu-16machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
-
-
shared-runners-manager-6-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-6instance to usee2-highcpu-16machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
-
-
shared-runners-manager-7-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-7instance to usee2-highcpu-16machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
-
Post-Change Steps - steps to take to verify the change
Monitor https://dashboards.gitlab.net/d/alerts-sat_single_node_cpu/alerts-single_node_cpu-saturation-detail?orgId=1&from=now-24h&to=now&panelId=57960&tz=UTC&var-environment=gprd&var-type=ci-runners&var-stage=main and confirm that the CPU saturation of shared-runners-manager-X instances is bellow 90% at the biggest load times.
Rollback
The change doesn't define the rollback per se, as it's mostly contains a Gracefull Shutdown - a process within which Runner was already instructed to shutdown and we need to wait until it will do this to not interrupt existing user's jobs.
Resize of the instance itself doesn't bring any negative effects.
In case of incidents caused by this change being worked on, we should stop the procedure immediately after resize of the ongoing instance is done and not resume it until the metrics are back to normal.
Rollback steps - steps to be taken in the event of a need to rollback this change
-
Ensure that the termination of Runner is finished properly -
Resize the GCE instance to e2-highcpu-16as the VM is already shutted down -
Restore the resized instance to the pool -
Stop proceeding the change rollback until the metrics are back to normal.
Monitoring
Key metrics to observe
- Metric:
ci-runners Service Apdex-
Location: https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?orgId=1&from=now-3h&to=now
-
What changes to this metric should prompt a rollback:
If the apdex value will drop bellow the defined SLO we should finish the resize of the instance that is being handled at the moment (we already need to wait until
Gracefull Shutdownwill properly exit Runner's service) but we should not start the resize of another instance until the value will be not above the SLO limit.
-
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
All five srmX instances will be resized from a custom machine types (4x12vCPU 16GB RAM and 1x10 vCPU 16GB RAM right now) to e2-highcpu-16 instance types.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
There are currently no active incidents.