Skip to content

Kubernetes executor does not clean up related resources after pod creation failure (step 1)

Proposal

  • This is step 1 to resolve the pod clean up issues. Refer to the analysis here for more context.
  • Use the garbage collection functionality of Kubernetes, which is basically linking all resources to a parent resource (most likely the POD.)
  • The linking will guarantee that whenever the parent resource is deleted, all other resources are deleted.
  • This minimizes the margin of one-off errors, leaves a lot less work for the Runner and makes the second step of my proposal easier to achieve reliably.

Summary

After the k8s executor fails to create a pod for a job, it seems that it doesn't clean up the secrets that were created.

First we see something like this:

Running with gitlab-runner 11.9.2 (fa86510e)
on awe-gitlab-runner-k8s z36gYwsv
Using Kubernetes namespace: awe-gitlab-runner-k8s
Using Kubernetes executor with image $CI_REGISTRY/path/to/image:latest ...
ERROR: Job failed (system failure): pods "runner-z36gywsv-project-7-concurrent-37498b9" is forbidden: exceeded quota: all-resources, requested: requests.memory=1536Mi, used: requests.memory=57232Mi, limited: requests.memory=56Gi

Then we later see this error:

Running with gitlab-runner 11.9.2 (fa86510e)
on awe-gitlab-runner-k8s z36gYwsv
Using Kubernetes namespace: awe-gitlab-runner-k8s
Using Kubernetes executor with image $CI_REGISTRY/path/to/image:latest ...
ERROR: Job failed (system failure): secrets "runner-z36gywsv-project-7-concurrent-4gzq9t" is forbidden: exceeded quota: all-resources, requested: secrets=1, used: secrets=100, limited: secrets=100

Steps to reproduce

It may be hard to reproduce this issue. You may be able to forcefully set a quota low so that pod creation fails.

Environment description

k8s executor

Used GitLab Runner version

Running with gitlab-runner 11.9.2
on awe-gitlab-runner-k8s z36gYwsv
Using Kubernetes namespace: awe-gitlab-runner-k8s 
Edited by Darren Eastman