16.3.0 Kubernetes runner pods not cleaned up

Summary

After upgrading the our runner fleet to 16.3.0, some pods are not cleaned up. At the moment we have pods with 13h+ ages

$ kubectl get pods
NAME                                                       READY   STATUS             RESTARTS   AGE
k8s-small-amd64-gitlab-runner-8544c6cbd7-wtbsc             1/1     Running            0          19h
runner-4bxsfxmcp-project-21912118-concurrent-0-wox1i11j    3/3     Running            0          99s
runner-4bxsfxmcp-project-21912118-concurrent-1-cee7m4bb    3/3     Running            0          98s
runner-4bxsfxmcp-project-21912118-concurrent-4-amau86na    3/3     Running            0          94s
runner-4bxsfxmcp-project-25032624-concurrent-10-tc15tf5w   2/2     Running            0          14h
runner-4bxsfxmcp-project-25032624-concurrent-15-op3fl9ab   2/2     Running            0          14h
runner-4bxsfxmcp-project-25032624-concurrent-18-94v22iwo   2/2     Running            0          14h
runner-4bxsfxmcp-project-25032624-concurrent-23-6km853mv   2/2     Running            0          13h
runner-4bxsfxmcp-project-25032624-concurrent-26-gmyh1de8   2/2     Running            0          13h
runner-4bxsfxmcp-project-25032624-concurrent-3-562pnu1n    2/2     Running            0          14h
runner-4bxsfxmcp-project-25032624-concurrent-3-qtg6pei9    2/2     Running            0          13h

There are new errors:

kubectl logs k8s-small-amd64-gitlab-runner-8544c6cbd7-wtbsc --timestamps | grep ERROR
2023-08-22T18:42:34.213602384Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924245918 project=25032624 runner=4bxsfxmcp
2023-08-22T18:42:39.263158588Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924246043 project=25032624 runner=4bxsfxmcp
2023-08-22T18:43:01.835877717Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924245994 project=25032624 runner=4bxsfxmcp
2023-08-22T18:46:18.283183523Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924273186 project=25032624 runner=4bxsfxmcp
2023-08-22T18:46:19.130244105Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924273187 project=25032624 runner=4bxsfxmcp
2023-08-22T18:48:46.253168351Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924276632 project=25032624 runner=4bxsfxmcp
2023-08-22T18:49:03.378934359Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924276609 project=25032624 runner=4bxsfxmcp
2023-08-22T18:49:06.252476067Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924276625 project=25032624 runner=4bxsfxmcp
2023-08-22T18:52:44.943489138Z ERROR: Job failed (system failure): prepare environment: waiting for pod running: timed out waiting for pod to start. Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information  duration_s=364.631672351 job=4924275120 project=25032624 runner=4bxsfxmcp
2023-08-22T19:24:36.357452223Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462250 project=25032624 runner=4bxsfxmcp
2023-08-22T19:24:37.821586016Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462257 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:09.100581331Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462219 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:11.946244711Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462224 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:23.426345253Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462275 project=25032624 runner=4bxsfxmcp
2023-08-22T19:25:24.445577435Z ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled  job=4924462282 project=25032624 runner=4bxsfxmcp

Steps to reproduce

Use version 16.3.0

Possible fixes

Will be rolling back to 16.2.1 to confirm this is a new issue

From what I can see "client rate limiter" error is coming from Kubernetes. https://github.com/kubernetes/client-go/blob/master/rest/request.go#L616 . As the error suggests I'm assuming something has changed in 16.3.0 that's causing k8s to rate limit requests from the runner

Edited by Christopher van de Sande