Skip to content

16.3.0 Kubernetes runner pods not cleaned up

Summary

After upgrading the our runner fleet to 16.3.0, some pods are not cleaned up. At the moment we have pods with 13h+ ages

$ kubectl get pods
NAME                                                       READY   STATUS             RESTARTS   AGE
k8s-small-amd64-gitlab-runner-8544c6cbd7-wtbsc             1/1     Running            0          19h
runner-4bxsfxmcp-project-21912118-concurrent-0-wox1i11j    3/3     Running            0          99s
runner-4bxsfxmcp-project-21912118-concurrent-1-cee7m4bb    3/3     Running            0          98s
runner-4bxsfxmcp-project-21912118-concurrent-4-amau86na    3/3     Running            0          94s
runner-4bxsfxmcp-project-25032624-concurrent-10-tc15tf5w   2/2     Running            0          14h
runner-4bxsfxmcp-project-25032624-concurrent-15-op3fl9ab   2/2     Running            0          14h
runner-4bxsfxmcp-project-25032624-concurrent-18-94v22iwo   2/2     Running            0          14h
runner-4bxsfxmcp-project-25032624-concurrent-23-6km853mv   2/2     Running            0          13h
runner-4bxsfxmcp-project-25032624-concurrent-26-gmyh1de8   2/2     Running            0          13h
runner-4bxsfxmcp-project-25032624-concurrent-3-562pnu1n    2/2     Running            0          14h
runner-4bxsfxmcp-project-25032624-concurrent-3-qtg6pei9    2/2     Running            0          13h

There are new errors:

kubectl logs k8s-small-amd64-gitlab-runner-8544c6cbd7-wtbsc --timestamps | grep ERROR
2023-08-22T18:42:34.213602384Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924245918 project=25032624 runner=4bxsfxmcp
2023-08-22T18:42:39.263158588Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924246043 project=25032624 runner=4bxsfxmcp
2023-08-22T18:43:01.835877717Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924245994 project=25032624 runner=4bxsfxmcp
2023-08-22T18:46:18.283183523Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924273186 project=25032624 runner=4bxsfxmcp
2023-08-22T18:46:19.130244105Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924273187 project=25032624 runner=4bxsfxmcp
2023-08-22T18:48:46.253168351Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924276632 project=25032624 runner=4bxsfxmcp
2023-08-22T18:49:03.378934359Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924276609 project=25032624 runner=4bxsfxmcp
2023-08-22T18:49:06.252476067Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924276625 project=25032624 runner=4bxsfxmcp
2023-08-22T18:52:44.943489138Z ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information  duration_s=364.631672351 job=4924275120 project=25032624 runner=4bxsfxmcp
2023-08-22T19:24:36.357452223Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462250 project=25032624 runner=4bxsfxmcp
2023-08-22T19:24:37.821586016Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462257 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:09.100581331Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462219 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:11.946244711Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462224 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:23.426345253Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462275 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:24.445577435Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462282 project=25032624 runner=4bxsfxmcp

Steps to reproduce

Use version 16.3.0

Possible fixes

Will be rolling back to 16.2.1 to confirm this is a new issue

From what I can see "client rate limiter" error is coming from Kubernetes. https://github.com/kubernetes/client-go/blob/master/rest/request.go#L616 . As the error suggests I'm assuming something has changed in 16.3.0 that's causing k8s to rate limit requests from the runner

Edited by Christopher van de Sande