Increase CPU capacity of srmX runner managers
Production Change
Change Summary
The shared-runners-manager-X
instances are still saturating the CPUs. This is limiting us on how many concurrent jobs the managers can handle.
For that purpose we've decided we will resize the instances from the currently used e2-highcpu-16
machine types to c2-standard-30
.
Also, to use the fact that we're going to do the 1-by-1 graceful shutdown on srmX
, we will also handle increasing the file descriptors limit as it's asked in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13220.
As this change requires an instance shutdown, the operation needs to be done with the usage of Graceful Shutdown
procedure and instance-by-instance, with only one instance being terminated at once.
Change Details
- Services Impacted - ServiceCI Runners
- Change Technician - @tmaczukin
- Change Criticality - C3
- Change Type - changescheduled
- Change Reviewer - @steveazz
- Due Date - 2021-04-28, starting at 10:00 UTC
-
Time tracking - Rough estimate is ~10-12 hours for all 5
srmX
machines. The change doesn't assume a rollback steps per se (described bellow). -
Downtime Component - Every
srmX
instance will be down for roughly 2-2.5 hours (time needed for Graceful Shutdown and for adjusting instance settings and restarting the VM). ServiceCI Runners should not be disrupted within this time as only one manager will be taken out of a pool at a time.
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
-
Make sure that you meet Administrator prerequisites before you will start any work. -
Not in a PCL time window. -
Check that you have administrative access to the gitlab-ci
project in GCP console. -
Check https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?viewPanel=79474957&orgId=1&from=now-3h&to=now and confirm that the ci-runners Service Apdex
score is above the SLO limit before starting the change rollout.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 120-150 minutes (2-2.5 hours) per instance. 600-750minutes (10-12.5 hours) in total (with 5 instances being updated 1-by-1)
As our upgrading script wasn't yet fixed to handle updates when Runner version is not changed, we will most probably see failures related to system service management. It can be fixed by executing
sudo apt-get install --reinstall gitlab-runner=13.11.0-rc1
bastion preparation
-
ensure that you have Auth agent forwarding set for ci bastion: Host lb-bastion.ci.gitlab.com ForwardAgent yes
-
login to the bastion server ssh lb-bastion.ci.gitlab.com
-
ensure that your chef key file and knife configuration file are created at ~/.chef/
ls ~/.chef/ knife.rb tmaczukin.pem
-
enter screen or tmux session (whatever is most convinient for you)
ALL KNIFE COMMANDS SHOULD BE EXECUTED FROM THIS PLACE
preparation
-
disable chef-client on srmX
nodes:knife ssh -afqdn 'roles:gitlab-runner-srm' -- sudo /root/runner_upgrade.sh stop_chef
-
merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5407 -
run apply_to_prod
job from the merge pipeline -
handle update on staging runners knife ssh -C1 -afqdn 'roles:gitlab-runner-stg-srm' -- sudo /root/runner_upgrade.sh
srm3
-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-3
instance to usec2-standard-30
machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
srm4
-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-4
instance to usec2-standard-30
machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
srm5
-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-5
instance to usec2-standard-30
machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
srm6
-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-6
instance to usec2-standard-30
machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
srm7
-
Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step. -
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-7
instance to usec2-standard-30
machine type -
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.
Post-Change Steps - steps to take to verify the change
Monitor https://dashboards.gitlab.net/d/alerts-sat_single_node_cpu/alerts-single_node_cpu-saturation-detail?orgId=1&from=now-24h&to=now&panelId=57960&tz=UTC&var-environment=gprd&var-type=ci-runners&var-stage=main and confirm that the CPU saturation of shared-runners-manager-X
instances is bellow 90% at the biggest load times.
Rollback
The change doesn't define the rollback per se, as it mostly contains a Gracefull Shutdown - a process within which Runner was already instructed to shutdown and we need to wait until it will do this to not interrupt existing user's jobs.
Resize of the instance itself doesn't bring any negative effects.
In case of incidents caused by this change being worked on, we should stop the procedure immediately after resize of the ongoing instance is done and not resume it until the metrics are back to normal.
Rollback steps - steps to be taken in the event of a need to rollback this change
-
Ensure that the termination of Runner is finished properly -
Resize the GCE instance to c2-standard-30
as the VM is already shutted down -
Restore the resized instance to the pool -
Stop proceeding the change rollback until the metrics are back to normal.
Monitoring
Key metrics to observe
- Metric:
ci-runners Service Apdex
-
What changes to this metric should prompt a rollback:
If the apdex value will drop bellow the defined SLO we should finish the resize of the instance that is being handled at the moment (we already need to wait until
Gracefull Shutdown
will properly exit Runner's service) but we should not start the resize of another instance until the value will be not above the SLO limit.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
All five srmX
instances will be resized from e2-highcpu-16
to c2-standard-30
instance types.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.