Skip to content

Add warning events on failure with k8s executor

Romuald Atchadé requested to merge k8s-pod-event-on-failure into main

What does this MR do?

When utilizing GitLab Runner with the Kubernetes executor and setting CPU/RAM/Ephemeral storage requests and limits, if a job surpasses any of the defined limits, Kubernetes will terminate the associated Pod. Unfortunately, the job logs fail to surface the reason of the failure leaving user unaware that the job failed due to exceeding these limits. As a result, the job log lacks meaningful details about the cause of the failure.

To address this issue, the warning events related to the failed pod are retrieved and log as warning.

Why was this MR needed?

To improve the troubleshooting experience, we can surface the latest event with the highest severity related to the failed pod. To do so, we can retrieve the event after the pod failure and present it as an warning log.

What's the best way to test this MR?

gitlab-ci
job:
  image: alpine333
  script:
    - sleep 120
config.toml
concurrent = 1
check_interval = 1
shutdown_timeout = 0

listen_address = ':9252'

[session_server]
  session_timeout = 1800

[[runners]]
  name = ""
  url = "https://gitlab.com/"
  id = 0
  token = "__REDACTED__"
  token_obtained_at = 0001-01-01T00:00:00Z
  token_expires_at = 0001-01-01T00:00:00Z
  executor = "kubernetes"
  shell = "bash"

  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = "alpine"
    namespace = ""
    namespace_overwrite_allowed = ""
    pod_labels_overwrite_allowed = ""
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""

    cpu_limit = "50m"
    memory_limit = "64Mi"
    service_cpu_limit = "50m"
    service_memory_limit = "64Mi"
    helper_cpu_limit = "50m"
    helper_memory_limit = "64Mi"

Try to run the job. It fails and the following cluster events are logged

WARNING: Event retrieved from the cluster: Failed to pull image "alpine333": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/alpine333:latest": failed to resolve reference "docker.io/library/alpine333:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
WARNING: Event retrieved from the cluster: Error: ErrImagePull
WARNING: Event retrieved from the cluster: Error: ImagePullBackOff

What are the relevant issue numbers?

close #31052 (closed)

Edited by Romuald Atchadé

Merge request reports