Kubernetes runner - Pods stuck in Pending or ContainerCreating due to "Failed create pod sandbox"
Summary
We're experiencing intermittent issues with the gitlab-runner using the Kubernetes executor (deployed using the first-party Helm charts).
An estimated 5% of our runner Pods get stuck in a Pending
or ContainerCreating
state and never start.
(We don't have this issue with any of our other workloads.)
Expected behavior
The runner pods should start within 60 seconds (depending on image size).
Relevant logs and/or screenshots
130 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
131 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
132 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
133 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
134 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
135 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
136 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
kubectl describe pod runner-fppqzpdg-project-31-concurrent-097xdq -n gitlab
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to ip-10-200-22-69.ap-southeast-2.compute.internal
Warning FailedCreatePodSandBox 93s (x4 over 8m13s) kubelet, ip-10-200-22-69.ap-southeast-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-q-r1em9v-project-31-concurrent-3hzrts": operation timeout: context deadline exceeded
Environment description
- Kubernetes
1.15.10
on AWS EKS (with latest/recommended CNI, CoreDNS and Kube Proxy versions from here) - Gitlab
12.9.3
- Gitlab Runner
12.9
- We are setting
KUBERNETES_POLL_TIMEOUT
to360
andKUBERNETES_POLL_INTERVAL
to5
- We're mounting the Node's
/var/lib/docker
and/var/run/docker.sock
into the runner Pods (by modifying theentrypoint
key in thegitlab-runner
ConfigMap
):
cat << EOF >> /home/gitlab-runner/.gitlab-runner/config.toml
[[runners.kubernetes.volumes.host_path]]
name = "docker"
mount_path = "/var/run/docker.sock"
read_only = false
host_path = "/var/run/docker.sock"
[[runners.kubernetes.volumes.host_path]]
name = "dockerlib"
mount_path = "/var/lib/docker"
read_only = false
host_path = "/var/lib/docker"
EOF
- We have dedicated Nodes (
c5.2xlarge
) for the runner jobs (usingtaints
,tolerations
and anodeSelector
) and resource Requests and Limits set:
Snippet from kubectl describe pod <runner pod>
:
Node-Selectors: NodeGroup=gitlab-runner
Containers:
build:
Limits:
cpu: 4
memory: 8Gi
Requests:
cpu: 1
memory: 1Gi
helper:
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 100m
memory: 100Mi
- We have autoscaling configured for the gitlab-runner nodes. This does work when the Pods are
Pending
because of resource requests exceeding the available amount, but the autoscaler "knows" that these Pods areContainerCreating
orPending
for a different reason and so doesn't try to scale.
Used GitLab Runner version
gitlab-runner --version
Version: 12.9.0
Git revision: 4c96e5ad
Git branch: 12-9-stable
GO version: go1.13.8
Built: 2020-03-20T13:01:56+0000
OS/Arch: linux/amd64
Edited by Matt Parkes