Job pods sometimes take a while to start with Runner on OpenShift

Summary

Customer gitlab jobs run in an openshift cluster. The job pods mostly start quickly. But about 2% of the time, they take between 60 seconds and 8+ minutes to reach a running state. Despite the logs show about load, this sometimes happens even when the node in question does not seem particularly busy. The node above has 48 CPU threads, 384GB of RAM, and a 2TB SSD. All nodes in the cluster are on the same VLAN and the same IP subnet in the same datacenter.

Possible fixes

Red Hat recommended that customer apply patch 4.8.47. After applying the patch customer noted no change in performance and has rolled back from that patch. Customer has opened a ticket with Red Hat (do not have the ticket # yet) and opened a ticket with the pubsec side of GitLab support (Zendesk ticket 3578).

We are asking that the GitLab prod team reach out to Red Hat to patch the problem affecting the GitLab runner pods.

Reference

Red Hat case # is 03274839

cc: @DarrenEastman @skamani

Edited Sep 12, 2022 by Darren Eastman