Corrective action: standardize/document the Runners scale up process
Summary
During 2024-02-21: concurrent operational limits on Sa... (production#17631 - closed), one of the mitigation steps the team took was to scale up the small shard, however, that was not a straight forward task, although it [was easier compared to a year ago](find link).
The team referenced https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/ci-runners/linux/new-shards.md and figured a lot of steps as they scaled the fleet (see slack thread), but they still failed to:
- Follow up on any errors the new runner-manager might have produced.
- Update the firewall rules as needed, which eventually caused 2024-02-21: us-east1-d.ci-gateway.int.gprd.gitl... (production#17636 - closed).
Related Incident(s)
Originating issue(s): production#17631 (closed) production#17636 (closed)
Desired Outcome/Acceptance Criteria
The steps for scaling up an existing fleet are documented and standardized, preferably in a CR template.
Associated Services
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose from -
Give context for what problem this corrective action is trying to prevent re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident) -
Assign a service label
Edited by Rehab