Investigate, track, and resolve constraint around jobs timing out due to insufficient private runners

Context

There was an incident (production#8107 (closed)) during security release 15.4.6 for jobs timing out, which was resolved by adding more private runners. This issue is to gather more data around private runners resources usage around security release/tagging time, and figure out if this was a one-off problem, or we should preemptively ask to increase number of private runners prior to the tagging process. If the conclusion of this investigation is that we are running out of private runners during tagging process, the exit criteria would be a change to our release checklist to include an item to coordinate with DRI to temporarily increase the number of private runners.

Starting point

There's a dashboard for ci-runners scale metrics: https://dashboards.gitlab.net/d/ci-runners-incident-autoscaling/ci-runners-incident-support-autoscaling?orgId=1 and within that dashboard, there is a panel specifically for jobs started on runners (by shard). By selecting the shard as private, we can inspect if there's a spike on new jobs on those runners, like how it was done during the incident investigation.

Edited Jan 31, 2023 by Jenny Kim