CPU saturation on 1 of 4 shared-runners-manager hosts

One of our four shared-runners-manager (SRM) hosts spontaneously jumped to 100% CPU usage.

Affected host:

shared-runners-manager-3.gitlab.com

Healthy peers:

shared-runners-manager-4.gitlab.com
shared-runners-manager-5.gitlab.com
shared-runners-manager-6.gitlab.com

This may be due to it repeatedly trying and failing to create new VMs on which to run CI jobs. Compared to its healthy peers, shared-runners-manager-3.gitlab.com has an abnormally large number of docker-machine <subcommand> processes, many of which are running subcommands that imply creating or stopping VMs.

For reference, here's the Slack thread where @nnelson @ahanselka and I have been jointly investigating this:

https://gitlab.slack.com/archives/CD6HFD1L0/p1580497681231000

Current working hypothesis is that something is wrong with the TLS certificate being used by docker-machine create, causing gitlab-runner to frequently try and fail to create VMs to execute CI jobs.

The queue of pending CI jobs is not growing a backlog, perhaps because the other 3 shared-runner-managers are still healthy.

From the Host Stats dashboard, here is the CPU usage graph:

Edited Jan 31, 2020 by Matt Smiley