Investigate cause of Runner pods stuck in Pending or the ContainerCreating state with error "Failed create pod sandbox" error

Overview

There are various reasons why a Kubernetes pod is stuck in the ContainerCreating state with the error Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod runner-xxxxxx-project-xx-concurrrent-xxxxxx

Some typical troubleshooting & resolution steps for this type of error are:

Retrieve information about the pod. kubectl describe pod example-runner-pod
Check the logs on the failing pod. kubectl logs example-runner-pod
Check the node the pod is meant to be scheduled on. kubectl describe node node-xxx
Restart the failing node.
Monitor node performance, specifically IOPS (throttled read/write operations).

Solutions identified to date by GitLab Runner users:

The GitLab Runner nodes were missing Kube and System reservations.
Build jobs using 100% CPU resulting in the node going NotReady.
High IOWAIT times on Azure burstable class VM's during git repo cloning on a 2GB repo.
On public clouds check the read/write IOPS performance of the nodes. Depending on if you notice disk throttling on the nodes, then change to a machine class with higher IOPS.

Tasks

Setup a test k8s cluster on AWS EKS (machine spec tbd). Install GitLab Runner via Helm and try and reproduce the issue.
Determine if there are changes we can make in the Runner code to better handle this error. If yes, document the proposed implementation steps.

Related issues

Edited Mar 03, 2022 by Darren Eastman