Kubernetes executor does not clean up related resources after pod creation failure (step 1)
Proposal
- This is step 1 to resolve the pod clean up issues. Refer to the analysis here for more context.
- Use the garbage collection functionality of Kubernetes, which is basically linking all resources to a parent resource (most likely the POD.)
- The linking will guarantee that whenever the parent resource is deleted, all other resources are deleted.
- This minimizes the margin of one-off errors, leaves a lot less work for the Runner and makes the second step of my proposal easier to achieve reliably.
Summary
After the k8s executor fails to create a pod for a job, it seems that it doesn't clean up the secrets that were created.
First we see something like this:
Running with gitlab-runner 11.9.2 (fa86510e)
on awe-gitlab-runner-k8s z36gYwsv
Using Kubernetes namespace: awe-gitlab-runner-k8s
Using Kubernetes executor with image $CI_REGISTRY/path/to/image:latest ...
ERROR: Job failed (system failure): pods "runner-z36gywsv-project-7-concurrent-37498b9" is forbidden: exceeded quota: all-resources, requested: requests.memory=1536Mi, used: requests.memory=57232Mi, limited: requests.memory=56Gi
Then we later see this error:
Running with gitlab-runner 11.9.2 (fa86510e)
on awe-gitlab-runner-k8s z36gYwsv
Using Kubernetes namespace: awe-gitlab-runner-k8s
Using Kubernetes executor with image $CI_REGISTRY/path/to/image:latest ...
ERROR: Job failed (system failure): secrets "runner-z36gywsv-project-7-concurrent-4gzq9t" is forbidden: exceeded quota: all-resources, requested: secrets=1, used: secrets=100, limited: secrets=100
Steps to reproduce
It may be hard to reproduce this issue. You may be able to forcefully set a quota low so that pod creation fails.
Environment description
k8s executor
Used GitLab Runner version
Running with gitlab-runner 11.9.2
on awe-gitlab-runner-k8s z36gYwsv
Using Kubernetes namespace: awe-gitlab-runner-k8s
Edited by Darren Eastman