Corrective action: standardize/document the Runners scale up process (#3107) · Issues · GitLab.com / GitLab Infrastructure Team / Observability / Observability Issue Tracker

Corrective action: standardize/document the Runners scale up process

## Summary  During https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17631+, one of the mitigation steps the team took was to scale up the `small` shard, however, that was not a straight forward task, although it [_was easier compared to a year ago_](find link). The team referenced https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/ci-runners/linux/new-shards.md and figured a lot of steps as they scaled the fleet (see [slack thread](https://gitlab.slack.com/archives/C06LKDXT0U8/p1708517328584599)), but they still failed to: 1. Follow up on any errors the new runner-manager might have produced. 2. Update the firewall rules as needed, which eventually caused https://gitlab.com/gitlab-com/gl-infra/production/-/issues/17636+. ## Related Incident(s)  Originating issue(s): production#17631 production#17636 ## Desired Outcome/Acceptance Criteria  The steps for scaling up an existing fleet are documented and standardized, preferably in a CR template. ## Associated Services  production-engineering~13295602 ## Corrective Action Issue Checklist * [x] Link the incident(s) this corrective action arose from * [x] Give context for what problem this corrective action is trying to prevent re-occurring * [x] Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') * [x] Assign a [priority](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/issues.html#issue-priority) (this will default to 'Reliability::P4' but should match the severity of the related incident) * [x] Assign a [service label](https://about.gitlab.com/handbook/engineering/infrastructure/team/reliability/issues.html#service-labels)

issue