Resize shared-runners-manager-X GCP instances

Production Change

Change Summary

The shared-runners-manager-X instances are frequently saturating the local resources, especially the CPU. This is limiting us on how many concurrent jobs the managers can handle.

For that purpose we've decided we will resize the instances from the currently used mix of different custom types to use e2-highcpu-16 consistently across all srmX Runner Manager instances.

As this change requires an instance shutdown, the operation needs to be done with the usage of Graceful Shutdown procedure and instance-by-instance, with only one instance being terminated at once.

Change Details

  1. Services Impacted - ServiceCI Runners
  2. Change Technician - @tmaczukin, @igorwwwwwwwwwwwwwwwwwwww
  3. Change Criticality - C3
  4. Change Type - changescheduled
  5. Change Reviewer - @steveazz
  6. Due Date - 2021-03-16, starting at 10:00 UTC
  7. Time tracking - Rough estimate is ~10-12 hours for all 5 srmX machines. The change doesn't assume a rollback steps per se (described bellow).
  8. Downtime Component - Every srmX instance will be down for roughly 2-2.5 hours (time needed for Graceful Shutdown and for adjusting instance settings and restarting the VM). ServiceCI Runners should not be disrupted within this time as only one manager will be taken out of a pool at a time.

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 120-150 minutes (2-2.5 hours) per instance. 600-750minutes (10-12.5 hours) in total (with 5 instances being updated 1-by-1)

Post-Change Steps - steps to take to verify the change

Monitor https://dashboards.gitlab.net/d/alerts-sat_single_node_cpu/alerts-single_node_cpu-saturation-detail?orgId=1&from=now-24h&to=now&panelId=57960&tz=UTC&var-environment=gprd&var-type=ci-runners&var-stage=main and confirm that the CPU saturation of shared-runners-manager-X instances is bellow 90% at the biggest load times.

Rollback

The change doesn't define the rollback per se, as it's mostly contains a Gracefull Shutdown - a process within which Runner was already instructed to shutdown and we need to wait until it will do this to not interrupt existing user's jobs.

Resize of the instance itself doesn't bring any negative effects.

In case of incidents caused by this change being worked on, we should stop the procedure immediately after resize of the ongoing instance is done and not resume it until the metrics are back to normal.

Rollback steps - steps to be taken in the event of a need to rollback this change

  • Ensure that the termination of Runner is finished properly
  • Resize the GCE instance to e2-highcpu-16 as the VM is already shutted down
  • Restore the resized instance to the pool
  • Stop proceeding the change rollback until the metrics are back to normal.

Monitoring

Key metrics to observe

  • Metric: ci-runners Service Apdex
    • Location: https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?orgId=1&from=now-3h&to=now

    • What changes to this metric should prompt a rollback:

      If the apdex value will drop bellow the defined SLO we should finish the resize of the instance that is being handled at the moment (we already need to wait until Gracefull Shutdown will properly exit Runner's service) but we should not start the resize of another instance until the value will be not above the SLO limit.

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

All five srmX instances will be resized from a custom machine types (4x12vCPU 16GB RAM and 1x10 vCPU 16GB RAM right now) to e2-highcpu-16 instance types.

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Tomasz Maczukin