You need to sign in or sign up before continuing.
Investigate cause of Runner pods stuck in Pending or the ContainerCreating state with error "Failed create pod sandbox" error
Overview
There are various reasons why a Kubernetes pod is stuck in the ContainerCreating state with the error Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod runner-xxxxxx-project-xx-concurrrent-xxxxxx
Some typical troubleshooting & resolution steps for this type of error are:
- Retrieve information about the pod.
kubectl describe pod example-runner-pod
- Check the logs on the failing pod.
kubectl logs example-runner-pod
- Check the node the pod is meant to be scheduled on.
kubectl describe node node-xxx
- Restart the failing node.
- Monitor node performance, specifically IOPS (throttled read/write operations).
Solutions identified to date by GitLab Runner users:
- The GitLab Runner nodes were missing Kube and System reservations.
- Build jobs using 100% CPU resulting in the node going NotReady.
- High IOWAIT times on Azure burstable class VM's during git repo cloning on a 2GB repo.
- On public clouds check the read/write IOPS performance of the nodes. Depending on if you notice disk throttling on the nodes, then change to a machine class with higher IOPS.
Tasks
-
Setup a test k8s cluster on AWS EKS (machine spec tbd). Install GitLab Runner via Helm and try and reproduce the issue. -
Determine if there are changes we can make in the Runner code to better handle this error. If yes, document the proposed implementation steps.
Related issues
Edited by Darren Eastman