16.3.0 Kubernetes runner pods not cleaned up
Summary
After upgrading the our runner fleet to 16.3.0, some pods are not cleaned up. At the moment we have pods with 13h+ ages
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
k8s-small-amd64-gitlab-runner-8544c6cbd7-wtbsc 1/1 Running 0 19h
runner-4bxsfxmcp-project-21912118-concurrent-0-wox1i11j 3/3 Running 0 99s
runner-4bxsfxmcp-project-21912118-concurrent-1-cee7m4bb 3/3 Running 0 98s
runner-4bxsfxmcp-project-21912118-concurrent-4-amau86na 3/3 Running 0 94s
runner-4bxsfxmcp-project-25032624-concurrent-10-tc15tf5w 2/2 Running 0 14h
runner-4bxsfxmcp-project-25032624-concurrent-15-op3fl9ab 2/2 Running 0 14h
runner-4bxsfxmcp-project-25032624-concurrent-18-94v22iwo 2/2 Running 0 14h
runner-4bxsfxmcp-project-25032624-concurrent-23-6km853mv 2/2 Running 0 13h
runner-4bxsfxmcp-project-25032624-concurrent-26-gmyh1de8 2/2 Running 0 13h
runner-4bxsfxmcp-project-25032624-concurrent-3-562pnu1n 2/2 Running 0 14h
runner-4bxsfxmcp-project-25032624-concurrent-3-qtg6pei9 2/2 Running 0 13h
There are new errors:
kubectl logs k8s-small-amd64-gitlab-runner-8544c6cbd7-wtbsc --timestamps | grep ERROR
2023-08-22T18:42:34.213602384Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924245918 project=25032624 runner=4bxsfxmcp
2023-08-22T18:42:39.263158588Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924246043 project=25032624 runner=4bxsfxmcp
2023-08-22T18:43:01.835877717Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924245994 project=25032624 runner=4bxsfxmcp
2023-08-22T18:46:18.283183523Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924273186 project=25032624 runner=4bxsfxmcp
2023-08-22T18:46:19.130244105Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924273187 project=25032624 runner=4bxsfxmcp
2023-08-22T18:48:46.253168351Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924276632 project=25032624 runner=4bxsfxmcp
2023-08-22T18:49:03.378934359Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924276609 project=25032624 runner=4bxsfxmcp
2023-08-22T18:49:06.252476067Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924276625 project=25032624 runner=4bxsfxmcp
2023-08-22T18:52:44.943489138Z ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information duration_s=364.631672351 job=4924275120 project=25032624 runner=4bxsfxmcp
2023-08-22T19:24:36.357452223Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924462250 project=25032624 runner=4bxsfxmcp
2023-08-22T19:24:37.821586016Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924462257 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:09.100581331Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924462219 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:11.946244711Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924462224 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:23.426345253Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924462275 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:24.445577435Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled job=4924462282 project=25032624 runner=4bxsfxmcp
Steps to reproduce
Use version 16.3.0
Possible fixes
Will be rolling back to 16.2.1 to confirm this is a new issue
From what I can see "client rate limiter" error is coming from Kubernetes. https://github.com/kubernetes/client-go/blob/master/rest/request.go#L616 . As the error suggests I'm assuming something has changed in 16.3.0 that's causing k8s to rate limit requests from the runner
Edited by Christopher van de Sande