Do not propagate Build context to k8s executor cleanup method (!4328) · Merge requests · GitLab.org / gitlab-runner

What does this MR do?

In !4125 (merged), we started propagating the build context in the whole k8s executor. This came with the side effect that, when the job is cancelled or it times out before the k8s resources cleanup, those resources stay on the cluster and the cleanup fails with the error below:

ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled

To prevent it to happen, a configurable timeout context (default to 5min) is used for the resources cleanup. This implementation is inspired of what is already implemented for the executordocker

Why was this MR needed?

Make sure resources are actually cleaned up at the end of a job (whether it succeeds or fails)

What's the best way to test this MR?

config.toml

concurrent = 1
check_interval = 1
log_level = "debug"
shutdown_timeout = 0

listen_address = ':9252'

[session_server]
  session_timeout = 1800

[[runners]]
  name = ""
  url = "https://gitlab.com/"
  id = 0
  token = "__REDACTED__"
  token_obtained_at = "0001-01-01T00:00:00Z"
  token_expires_at = "0001-01-01T00:00:00Z"
  executor = "kubernetes"
  shell = "bash"
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = "alpine"
    namespace = ""
    namespace_overwrite_allowed = ""
    pod_labels_overwrite_allowed = ""
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    node_selector_overwrite_allowed = ".*"

    [runners.kubernetes.volumes]
    [[runners.kubernetes.services]]
      name = "alpine:latest"
      alias = "alpine-service"
      command = ["sleep 900s"]
      entrypoint = ["/bin/sh", "-c"]
      port = 8080

gitlab-ci.yml

job:
  timeout: 2m
  image: alpine
  script:
    - sleep 180

Run a job using the config.toml and the gitlab-ci.yml provided above.

One the main branch, after the job times out, the pod won't be cleaned up. Using the MR branch, the problem won't happen.

In my test on the main branch, the pod name is runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o. We can see that it outlives the job timeout and is still running.

❯ kubectl get pod runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o
NAME                                                     READY   STATUS    RESTARTS   AGE
runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o   3/3     Running   0          2m17s
❯ kubectl get pod runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o
NAME                                                     READY   STATUS    RESTARTS   AGE
runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o   3/3     Running   0          2m20s
❯ kubectl get pod runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o
NAME                                                     READY   STATUS    RESTARTS   AGE
runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o   3/3     Running   0          2m23s
❯ kubectl get pod runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o
NAME                                                     READY   STATUS    RESTARTS   AGE
runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o   3/3     Running   0          2m24s

What are the relevant issue numbers?

Fixes #36803 (closed)

Edited Sep 01, 2023 by Romuald Atchadé

Do not propagate Build context to k8s executor cleanup method

What does this MR do?

Why was this MR needed?

What's the best way to test this MR?

What are the relevant issue numbers?

Merge request reports