Corrective action: standardize/document the Runners scale up process
## Summary <!-- Give context for what problem this issue is trying to prevent from happening again. Provide a brief assessment of the risk (chance and impact) of the problem that this corrective action fixes, to assist with triage and prioritization. --> During https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17631+, one of the mitigation steps the team took was to scale up the `small` shard, however, that was not a straight forward task, although it [_was easier compared to a year ago_](find link). The team referenced https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/ci-runners/linux/new-shards.md and figured a lot of steps as they scaled the fleet (see [slack thread](https://gitlab.slack.com/archives/C06LKDXT0U8/p1708517328584599)), but they still failed to: 1. Follow up on any errors the new runner-manager might have produced. 2. Update the firewall rules as needed, which eventually caused https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17636+. ## Related Incident(s) <!-- Note the originating incident(s) and link known related incidents/other issues. The relation will happen automatically if you are creating this issue from an incident, if this isn't done already please uncomment the following line: /relate gitlab-com/gl-infra/production#ISSUE_ID --> Originating issue(s): production#17631 production#17636 ## Desired Outcome/Acceptance Criteria <!-- How will you know that this issue is complete? If you have any initial thoughts on implementation details (e.g. what to do or not do, gotchas, edge cases etc.), please share them while they are fresh in your mind. --> The steps for scaling up an existing fleet are documented and standardized, preferably in a CR template. ## Associated Services <!-- Apply the appropriate services associated with this corrective action if applicable. /label production-engineering~13295602 --> production-engineering~13295602 ## Corrective Action Issue Checklist * [x] Link the incident(s) this corrective action arose from * [x] Give context for what problem this corrective action is trying to prevent re-occurring * [x] Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') * [x] Assign a [priority](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/issues.html#issue-priority) (this will default to 'Reliability::P4' but should match the severity of the related incident) * [x] Assign a [service label](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/issues.html#service-labels)
issue