Corrective action: standardize/document the Runners scale up process

Summary

During 2024-02-21: concurrent operational limits on Sa... (production#17631 - closed), one of the mitigation steps the team took was to scale up the small shard, however, that was not a straight forward task, although it [was easier compared to a year ago](find link).

The team referenced https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/ci-runners/linux/new-shards.md and figured a lot of steps as they scaled the fleet (see slack thread), but they still failed to:

  1. Follow up on any errors the new runner-manager might have produced.
  2. Update the firewall rules as needed, which eventually caused 2024-02-21: us-east1-d.ci-gateway.int.gprd.gitl... (production#17636 - closed).

Related Incident(s)

Originating issue(s): production#17631 (closed) production#17636 (closed)

Desired Outcome/Acceptance Criteria

The steps for scaling up an existing fleet are documented and standardized, preferably in a CR template.

Associated Services

ServiceCI Runners

Corrective Action Issue Checklist

  • Link the incident(s) this corrective action arose from
  • Give context for what problem this corrective action is trying to prevent re-occurring
  • Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident)
  • Assign a service label
Edited by Rehab