Investigate cause of Runner pods stuck in Pending or the ContainerCreating state with error "Failed create pod sandbox" error
Overview
There are various reasons why a Kubernetes pod is stuck in the ContainerCreating state with the error Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod runner-xxxxxx-project-xx-concurrrent-xxxxxx
Some typical troubleshooting & resolution steps for this type of error are:
- Retrieve information about the pod.
kubectl describe pod example-runner-pod
- Check the logs on the failing pod.
kubectl logs example-runner-pod
- Check the node the pod is meant to be scheduled on.
kubectl describe node node-xxx
- Restart the failing node.
- Monitor node performance, specifically IOPS (throttled read/write operations).
Solutions identified to date by GitLab Runner users:
- The GitLab Runner nodes were missing Kube and System reservations.
- Build jobs using 100% CPU resulting in the node going NotReady.
- High IOWAIT times on Azure burstable class VM's during git repo cloning on a 2GB repo.
- On public clouds check the read/write IOPS performance of the nodes. Depending on if you notice disk throttling on the nodes, then change to a machine class with higher IOPS.
Tasks
-
Setup a test k8s cluster on AWS EKS (machine spec tbd). Install GitLab Runner via Helm and try and reproduce the issue. -
Determine if there are changes we can make in the Runner code to better handle this error. If yes, document the proposed implementation steps.
Related issues
Edited by Darren Eastman