Skip to content

Shared Runner Managers are possibly underprovisioned

Shared runner managers need to be scaled up. On one host, for example, tasks spend 1.2s waiting for a CPU for every second that they run.

CPU is pinned near 100% for much of the day.

Load average 15 is around 2 per core.

Details

node_schedstat_waiting_seconds_total on shared-runners-manager-3.gitlab.com up at 125%.

image

https://prometheus.gprd.gitlab.net/graph?g0.expr=max%20by%20(fqdn)%20(rate(node_schedstat_waiting_seconds_total%7Bfqdn%3D%22shared-runners-manager-3.gitlab.com%22%2C%20type%3D%22ci-runners%22%7D%5B1h%5D)*100)%0A%0A&g0.tab=0&g0.stacked=0&g0.range_input=2w

https://dashboards.gitlab.net/d/ci-runners-main/ci-runners-overview?orgId=1

cc @dawsmith @tmaczukin