Kubernetes runner - Pods stuck in Pending or ContainerCreating due to "Failed create pod sandbox"
Summary
We're experiencing intermittent issues with the gitlab-runner using the Kubernetes executor (deployed using the first-party Helm charts).
An estimated 5% of our runner Pods get stuck in a Pending or ContainerCreating state and never start.
(We don't have this issue with any of our other workloads.)
Expected behavior
The runner pods should start within 60 seconds (depending on image size).
Relevant logs and/or screenshots
130 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
131 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
132 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
133 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
134 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
135 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
136 Waiting for pod gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to be running, status is Pending
kubectl describe pod runner-fppqzpdg-project-31-concurrent-097xdq -n gitlab
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned gitlab/runner-q-r1em9v-project-31-concurrent-3hzrts to ip-10-200-22-69.ap-southeast-2.compute.internal
Warning FailedCreatePodSandBox 93s (x4 over 8m13s) kubelet, ip-10-200-22-69.ap-southeast-2.compute.internal Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-q-r1em9v-project-31-concurrent-3hzrts": operation timeout: context deadline exceeded
Environment description
- Kubernetes
1.15.10on AWS EKS (with latest/recommended CNI, CoreDNS and Kube Proxy versions from here) - Gitlab
12.9.3 - Gitlab Runner
12.9 - We are setting
KUBERNETES_POLL_TIMEOUTto360andKUBERNETES_POLL_INTERVALto5 - We're mounting the Node's
/var/lib/dockerand/var/run/docker.sockinto the runner Pods (by modifying theentrypointkey in thegitlab-runnerConfigMap):
cat << EOF >> /home/gitlab-runner/.gitlab-runner/config.toml
[[runners.kubernetes.volumes.host_path]]
name = "docker"
mount_path = "/var/run/docker.sock"
read_only = false
host_path = "/var/run/docker.sock"
[[runners.kubernetes.volumes.host_path]]
name = "dockerlib"
mount_path = "/var/lib/docker"
read_only = false
host_path = "/var/lib/docker"
EOF
- We have dedicated Nodes (
c5.2xlarge) for the runner jobs (usingtaints,tolerationsand anodeSelector) and resource Requests and Limits set:
Snippet from kubectl describe pod <runner pod>:
Node-Selectors: NodeGroup=gitlab-runner
Containers:
build:
Limits:
cpu: 4
memory: 8Gi
Requests:
cpu: 1
memory: 1Gi
helper:
Limits:
cpu: 1
memory: 2Gi
Requests:
cpu: 100m
memory: 100Mi
- We have autoscaling configured for the gitlab-runner nodes. This does work when the Pods are
Pendingbecause of resource requests exceeding the available amount, but the autoscaler "knows" that these Pods areContainerCreatingorPendingfor a different reason and so doesn't try to scale.
Used GitLab Runner version
gitlab-runner --version
Version: 12.9.0
Git revision: 4c96e5ad
Git branch: 12-9-stable
GO version: go1.13.8
Built: 2020-03-20T13:01:56+0000
OS/Arch: linux/amd64
Edited by Matt Parkes