CPU saturation on 1 of 4 shared-runners-manager hosts
One of our four shared-runners-manager (SRM) hosts spontaneously jumped to 100% CPU usage.
Affected host:
shared-runners-manager-3.gitlab.com
Healthy peers:
shared-runners-manager-4.gitlab.com
shared-runners-manager-5.gitlab.com
shared-runners-manager-6.gitlab.com
This may be due to it repeatedly trying and failing to create new VMs on which to run CI jobs. Compared to its healthy peers, shared-runners-manager-3.gitlab.com
has an abnormally large number of docker-machine <subcommand>
processes, many of which are running subcommands that imply creating or stopping VMs.
For reference, here's the Slack thread where @nnelson @ahanselka and I have been jointly investigating this:
https://gitlab.slack.com/archives/CD6HFD1L0/p1580497681231000
Current working hypothesis is that something is wrong with the TLS certificate being used by docker-machine create
, causing gitlab-runner
to frequently try and fail to create VMs to execute CI jobs.
The queue of pending CI jobs is not growing a backlog, perhaps because the other 3 shared-runner-managers are still healthy.
From the Host Stats dashboard, here is the CPU usage graph: