Skip to content

Investigate cause of Runner pods stuck in Pending or the ContainerCreating state with error "Failed create pod sandbox" error

Overview

There are various reasons why a Kubernetes pod is stuck in the ContainerCreating state with the error Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod runner-xxxxxx-project-xx-concurrrent-xxxxxx

Some typical troubleshooting & resolution steps for this type of error are:

  1. Retrieve information about the pod. kubectl describe pod example-runner-pod
  2. Check the logs on the failing pod. kubectl logs example-runner-pod
  3. Check the node the pod is meant to be scheduled on. kubectl describe node node-xxx
  4. Restart the failing node.
  5. Monitor node performance, specifically IOPS (throttled read/write operations).

Solutions identified to date by GitLab Runner users:

  1. The GitLab Runner nodes were missing Kube and System reservations.
  2. Build jobs using 100% CPU resulting in the node going NotReady.
  3. High IOWAIT times on Azure burstable class VM's during git repo cloning on a 2GB repo.
  4. On public clouds check the read/write IOPS performance of the nodes. Depending on if you notice disk throttling on the nodes, then change to a machine class with higher IOPS.

Tasks

  • Setup a test k8s cluster on AWS EKS (machine spec tbd). Install GitLab Runner via Helm and try and reproduce the issue.
  • Determine if there are changes we can make in the Runner code to better handle this error. If yes, document the proposed implementation steps.

Related issues

Edited by Darren Eastman