Increase CPU capacity of srmX runner managers

Production Change

Change Summary

The shared-runners-manager-X instances are still saturating the CPUs. This is limiting us on how many concurrent jobs the managers can handle.

For that purpose we've decided we will resize the instances from the currently used e2-highcpu-16 machine types to c2-standard-30.

Also, to use the fact that we're going to do the 1-by-1 graceful shutdown on srmX, we will also handle increasing the file descriptors limit as it's asked in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13220.

As this change requires an instance shutdown, the operation needs to be done with the usage of Graceful Shutdown procedure and instance-by-instance, with only one instance being terminated at once.

Change Details

Services Impacted - ServiceCI Runners
Change Technician - @tmaczukin
Change Criticality - C3
Change Type - changescheduled
Change Reviewer - @steveazz
Due Date - 2021-04-28, starting at 10:00 UTC
Time tracking - Rough estimate is ~10-12 hours for all 5 srmX machines. The change doesn't assume a rollback steps per se (described bellow).
Downtime Component - Every srmX instance will be down for roughly 2-2.5 hours (time needed for Graceful Shutdown and for adjusting instance settings and restarting the VM). ServiceCI Runners should not be disrupted within this time as only one manager will be taken out of a pool at a time.

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Make sure that you meet Administrator prerequisites before you will start any work.
Not in a PCL time window.
Check that you have administrative access to the gitlab-ci project in GCP console.
Check https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?viewPanel=79474957&orgId=1&from=now-3h&to=now and confirm that the ci-runners Service Apdex score is above the SLO limit before starting the change rollout.

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 120-150 minutes (2-2.5 hours) per instance. 600-750minutes (10-12.5 hours) in total (with 5 instances being updated 1-by-1)

As our upgrading script wasn't yet fixed to handle updates when Runner version is not changed, we will most probably see failures related to system service management. It can be fixed by executing sudo apt-get install --reinstall gitlab-runner=13.11.0-rc1

bastion preparation

ensure that you have Auth agent forwarding set for ci bastion:
```
Host lb-bastion.ci.gitlab.com
    ForwardAgent yes
```
login to the bastion server
```
ssh lb-bastion.ci.gitlab.com
```
ensure that your chef key file and knife configuration file are created at ~/.chef/
```
ls ~/.chef/
knife.rb  tmaczukin.pem
```
enter screen or tmux session (whatever is most convinient for you)

ALL KNIFE COMMANDS SHOULD BE EXECUTED FROM THIS PLACE

preparation

disable chef-client on srmX nodes:

knife ssh -afqdn 'roles:gitlab-runner-srm' -- sudo /root/runner_upgrade.sh stop_chef

merge https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/5407
run apply_to_prod job from the merge pipeline

handle update on staging runners

knife ssh -C1 -afqdn 'roles:gitlab-runner-stg-srm' -- sudo /root/runner_upgrade.sh

srm3

Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step.
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-3 instance to use c2-standard-30 machine type
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.

srm4

Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step.
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-4 instance to use c2-standard-30 machine type
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.

srm5

Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step.
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-5 instance to use c2-standard-30 machine type
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.

srm6

Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step.
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-6 instance to use c2-standard-30 machine type
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.

srm7

Follow How to stop or restart Runner Manager's VM with Graceful Shutdown procedure, the If you want to stop the VM path, until the Do whatever you needed to do with Runner's VM terminated step.
When the instance is stopped, go to GCP console and change the specification of the shared-runners-manager-7 instance to use c2-standard-30 machine type
When the specification is saved go back to the How to stop or restart Runner Manager's VM with Graceful Shutdown procedure and continue until it's fully done.

Post-Change Steps - steps to take to verify the change

Monitor https://dashboards.gitlab.net/d/alerts-sat_single_node_cpu/alerts-single_node_cpu-saturation-detail?orgId=1&from=now-24h&to=now&panelId=57960&tz=UTC&var-environment=gprd&var-type=ci-runners&var-stage=main and confirm that the CPU saturation of shared-runners-manager-X instances is bellow 90% at the biggest load times.

Rollback

The change doesn't define the rollback per se, as it mostly contains a Gracefull Shutdown - a process within which Runner was already instructed to shutdown and we need to wait until it will do this to not interrupt existing user's jobs.

Resize of the instance itself doesn't bring any negative effects.

In case of incidents caused by this change being worked on, we should stop the procedure immediately after resize of the ongoing instance is done and not resume it until the metrics are back to normal.

Rollback steps - steps to be taken in the event of a need to rollback this change

Ensure that the termination of Runner is finished properly
Resize the GCE instance to c2-standard-30 as the VM is already shutted down
Restore the resized instance to the pool
Stop proceeding the change rollback until the metrics are back to normal.

Monitoring

Key metrics to observe

Metric: ci-runners Service Apdex
- Location: https://dashboards.gitlab.net/d/ci-runners-incident-runner-manager/ci-runners-incident-support-runner-manager?orgId=1&refresh=1m&from=now-6h&to=now&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-shard=shared&var-runner_manager=All&var-jobs_running_for_project=0&var-runner_job_failure_reason=All
- What changes to this metric should prompt a rollback:
  
  If the apdex value will drop bellow the defined SLO we should finish the resize of the instance that is being handled at the moment (we already need to wait until Gracefull Shutdown will properly exit Runner's service) but we should not start the resize of another instance until the value will be not above the SLO limit.

Summary of infrastructure changes

~~Does this change introduce new compute instances?~~
Does this change re-size any existing compute instances?
~~Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?~~

All five srmX instances will be resized from e2-highcpu-16 to c2-standard-30 instance types.

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
~~Change has been tested in staging and results noted in a comment on this issue.~~
~~A dry-run has been conducted and results noted in a comment on this issue.~~
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited May 05, 2021 by Steve Xuereb