Gitlab Runner job pods resulting in zombie containerd-shim-runc-v2 processes
Issue type: Bug (bug template not showing up in drop down)
Runner version: 15.7.3
Gitlab version: 15.8.2
Context
We use the Gitlab Runner Kubernetes Executor to schedule job pods on dedicated runner nodes, which are using containerd and runc (no docker) to create container processes. This is an EKS cluster, and we are using the standard AL2 AMIs with only a few customizations included. This error did not appear before we migrated to EKS.
Problem
After running for a few days, we begin to see errors such as the following in various pipeline stages:
fatal: unable to create thread: Resource temporarily unavailable
Inspecting the runner nodes reveals a very large number of containerd -> containerd-shim-runc-v2 process running:
ps aux | grep containerd-shim-runc-v2 | wc -l
2584
These processes are not associated to any running containers, and therefore appear to be zombie processes. When this carries on for long enough, we see a complete exhaustion of all available processes on the nodes until they are manually restarted or containerd itself is restarted. We currently have a /proc/sys/kernel/pid_max
value of ~32000, which we can increase to help out. However, this seems like errant behavior. It may in fact be a containerd or containerd-shim-runc-v2 issue, but I would like to pick someone's brain for tips on how to ensure graceful shutdown of job pods and containers is occurring.