Skip to content

Do not propagate Build context to k8s executor cleanup method

Romuald Atchadé requested to merge k8s-context-propagation-issue into main

What does this MR do?

In !4125 (merged), we started propagating the build context in the whole k8s executor. This came with the side effect that, when the job is cancelled or it times out before the k8s resources cleanup, those resources stay on the cluster and the cleanup fails with the error below:

ERROR: Error cleaning up pod: client rate limiter Wait returned an error: context canceled

To prevent it to happen, a configurable timeout context (default to 5min) is used for the resources cleanup. This implementation is inspired of what is already implemented for the executordocker

Why was this MR needed?

Make sure resources are actually cleaned up at the end of a job (whether it succeeds or fails)

What's the best way to test this MR?

config.toml
concurrent = 1
check_interval = 1
log_level = "debug"
shutdown_timeout = 0

listen_address = ':9252'

[session_server]
  session_timeout = 1800

[[runners]]
  name = ""
  url = "https://gitlab.com/"
  id = 0
  token = "__REDACTED__"
  token_obtained_at = "0001-01-01T00:00:00Z"
  token_expires_at = "0001-01-01T00:00:00Z"
  executor = "kubernetes"
  shell = "bash"
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = "alpine"
    namespace = ""
    namespace_overwrite_allowed = ""
    pod_labels_overwrite_allowed = ""
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    node_selector_overwrite_allowed = ".*"

    [runners.kubernetes.volumes]
    [[runners.kubernetes.services]]
      name = "alpine:latest"
      alias = "alpine-service"
      command = ["sleep 900s"]
      entrypoint = ["/bin/sh", "-c"]
      port = 8080
gitlab-ci.yml
job:
  timeout: 2m
  image: alpine
  script:
    - sleep 180

Run a job using the config.toml and the gitlab-ci.yml provided above.

One the main branch, after the job times out, the pod won't be cleaned up. Using the MR branch, the problem won't happen.

In my test on the main branch, the pod name is runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o. We can see that it outlives the job timeout and is still running.

❯ kubectl get pod runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o
NAME                                                     READY   STATUS    RESTARTS   AGE
runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o   3/3     Running   0          2m17s
❯ kubectl get pod runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o
NAME                                                     READY   STATUS    RESTARTS   AGE
runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o   3/3     Running   0          2m20s
❯ kubectl get pod runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o
NAME                                                     READY   STATUS    RESTARTS   AGE
runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o   3/3     Running   0          2m23s
❯ kubectl get pod runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o
NAME                                                     READY   STATUS    RESTARTS   AGE
runner-dzfsjrxx-project-25452826-concurrent-0-dlbxp15o   3/3     Running   0          2m24s

What are the relevant issue numbers?

Fixes #36803 (closed)

Edited by Romuald Atchadé

Merge request reports