Skip to content

Slow-starting PODs not cleared using k8s runner

I've deployed gitlab runner on k8s using the official helm chart. To sparsely use cluster resources it is autoscaling which can sometimes lead to long startup times of new pods. When a pod startup exceeds poll_timeout the build job fails, however the pod remains behind and is never cleared, leading to slowly eating all k8s resources. The same behaviour can be observed when something goes wrong in the cluster and build jobs fail. In that case too, pods remain behind.

This is an example of cluster state when no builds are running after a slow scaling operation:

❯ kubectl get pods
NAME                                             READY     STATUS    RESTARTS   AGE
gitlab-runner-gitlab-runner-7bfb84dddd-6k5hf     1/1       Running   0          2h
minio-7947576b9-k6p9c                            1/1       Running   0          2h
runner-dfad5df5-project-3046-concurrent-0v9nvx   2/2       Running   0          2h
runner-dfad5df5-project-3046-concurrent-1whw8b   2/2       Running   0          2h
runner-dfad5df5-project-3046-concurrent-29g6w7   2/2       Running   0          2h
runner-dfad5df5-project-3046-concurrent-4gwm6x   2/2       Running   0          2h
runner-dfad5df5-project-3046-concurrent-5c9kb4   2/2       Running   0          2h
runner-dfad5df5-project-3046-concurrent-6b4b9z   2/2       Running   0          2h

It seems that gitlab needs to provide better cleanup mechanisms for failed/stalled/slow pods on k8s?

Edited by andig