Add warning events on failure with k8s executor
What does this MR do?
When utilizing GitLab Runner with the Kubernetes executor and setting CPU/RAM/Ephemeral storage requests and limits, if a job surpasses any of the defined limits, Kubernetes will terminate the associated Pod. Unfortunately, the job logs fail to surface the reason of the failure leaving user unaware that the job failed due to exceeding these limits. As a result, the job log lacks meaningful details about the cause of the failure.
To address this issue, the warning events
related to the failed pod are retrieved and log as warning
.
Why was this MR needed?
To improve the troubleshooting experience, we can surface the latest event with the highest severity related to the failed pod. To do so, we can retrieve the event after the pod failure and present it as an warning log.
What's the best way to test this MR?
gitlab-ci
job:
image: alpine333
script:
- sleep 120
config.toml
concurrent = 1
check_interval = 1
shutdown_timeout = 0
listen_address = ':9252'
[session_server]
session_timeout = 1800
[[runners]]
name = ""
url = "https://gitlab.com/"
id = 0
token = "__REDACTED__"
token_obtained_at = 0001-01-01T00:00:00Z
token_expires_at = 0001-01-01T00:00:00Z
executor = "kubernetes"
shell = "bash"
[runners.kubernetes]
host = ""
bearer_token_overwrite_allowed = false
image = "alpine"
namespace = ""
namespace_overwrite_allowed = ""
pod_labels_overwrite_allowed = ""
service_account_overwrite_allowed = ""
pod_annotations_overwrite_allowed = ""
cpu_limit = "50m"
memory_limit = "64Mi"
service_cpu_limit = "50m"
service_memory_limit = "64Mi"
helper_cpu_limit = "50m"
helper_memory_limit = "64Mi"
Try to run the job. It fails and the following cluster events are logged
WARNING: Event retrieved from the cluster: Failed to pull image "alpine333": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/alpine333:latest": failed to resolve reference "docker.io/library/alpine333:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
WARNING: Event retrieved from the cluster: Error: ErrImagePull
WARNING: Event retrieved from the cluster: Error: ImagePullBackOff
What are the relevant issue numbers?
close #31052 (closed)