Kubernetes runner - Pods stuck in Pending or ContainerCreating due to "Failed create pod sandbox"

Summary

We're experiencing intermittent issues with the gitlab-runner using the Kubernetes executor (deployed using the first-party Helm charts).
An estimated 5% of our runner Pods get stuck in a Pending or ContainerCreating state and never start.

(We don't have this issue with any of our other workloads.)

Expected behavior

The runner pods should start within 60 seconds (depending on image size).

Relevant logs and/or screenshots

130  Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
131  Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
132  Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
133  Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
134  Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
135  Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
136  Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending

kubectl describe pod runner-fppqzpdg-project-31-concurrent-097xdq -n gitlab

Events:
  Type     Reason                  Age                  From                                                      Message
  ----     ------                  ----                 ----                                                      -------
  Normal   Scheduled               10m                  default-scheduler                                         Successfully assigned gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to ip-10-200-22-69.ap-southeast-2.compute.internal
  Warning  FailedCreatePodSandBox  93s (x4 over 8m13s)  kubelet, ip-10-200-22-69.ap-southeast-2.compute.internal  Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-q-r1em9v-project-31-concurrent-3hzrts": operation timeout: context deadline exceeded

Environment description

Kubernetes 1.15.10 on AWS EKS (with latest/recommended CNI, CoreDNS and Kube Proxy versions from here)
Gitlab 12.9.3
Gitlab Runner 12.9
We are setting KUBERNETES_POLL_TIMEOUT to 360 and KUBERNETES_POLL_INTERVAL to 5
We're mounting the Node's /var/lib/docker and /var/run/docker.sock into the runner Pods (by modifying the entrypoint key in the gitlab-runner ConfigMap):

    cat << EOF >> /home/gitlab-runner/.gitlab-runner/config.toml
        [[runners.kubernetes.volumes.host_path]]
          name = "docker"
          mount_path = "/var/run/docker.sock"
          read_only = false
          host_path = "/var/run/docker.sock"
        [[runners.kubernetes.volumes.host_path]]
          name = "dockerlib"
          mount_path = "/var/lib/docker"
          read_only = false
          host_path = "/var/lib/docker"
    EOF

We have dedicated Nodes (c5.2xlarge) for the runner jobs (using taints, tolerations and a nodeSelector) and resource Requests and Limits set:

Snippet from kubectl describe pod <runner pod>:

Node-Selectors:  NodeGroup=gitlab-runner

Containers:
  build:
    Limits:
      cpu:     4
      memory:  8Gi
    Requests:
      cpu:     1
      memory:  1Gi

  helper:
    Limits:
      cpu:     1
      memory:  2Gi
    Requests:
      cpu:     100m
      memory:  100Mi

We have autoscaling configured for the gitlab-runner nodes. This does work when the Pods are Pending because of resource requests exceeding the available amount, but the autoscaler "knows" that these Pods are ContainerCreating or Pending for a different reason and so doesn't try to scale.

Used GitLab Runner version

gitlab-runner --version
Version:      12.9.0
Git revision: 4c96e5ad
Git branch:   12-9-stable
GO version:   go1.13.8
Built:        2020-03-20T13:01:56+0000
OS/Arch:      linux/amd64

Edited Apr 16, 2020 by Matt Parkes