Add Kubernetes Pod Failure Event Description to Runner
Description
When utilizing GitLab Runner with the Kubernetes executor and setting CPU/RAM/Ephemeral storage requests and limits, if a job surpasses any of the defined limits, Kubernetes will terminate the associated Pod. Unfortunately, the job logs fail to surface the reason of the failure leaving the user unaware that the job failed due to exceeding these limits. As a result, the job log lacks meaningful details about the cause of the failure.
To address this issue, it would be beneficial to enhance the job log output to provide clear indications and informative messages explaining job failure.
Proposal
To improve the troubleshooting experience, we can surface the latest event with the highest severity related to the failed pod. To do so, we can retrieve the event after the pod failure and present it as an error log. This would provide valuable information about the cause of the failure.
For the first iteration, this feature can be delivered behind a Feature Flag.
Notes
When the job are ran on EKS, the k8s events are not kept for long on the cluster.
Example log output with new warning events logging
In the example below - lines 20 to 23 are logged only if warning
events are found on the cluster for the Worker Pod.
Running with gitlab-runner development version (HEAD)
on DzfSJrxx, system ID: s_b1aacad1f7fa
feature flags: FF_KUBERNETES_HONOR_ENTRYPOINT:true, FF_USE_ADVANCED_POD_SPEC_CONFIGURATION:true
Preparing the "kubernetes" executor
00:00
WARNING: Namespace is empty, therefore assuming 'default'.
Using Kubernetes namespace: default
Using Kubernetes executor with image alpine333 ...
Using attach strategy to execute scripts...
Preparing environment
00:07
WARNING: Advanced Pod Spec configuration enabled, merging the provided PodSpec to the generated one. This is an alpha feature and is subject to change. Feedback is collected in this issue: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/29659 ...
Waiting for pod default/runner-dzfsjrxx-project-25452826-concurrent-0-w2vtpqe5 to be running, status is Pending
ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
ContainersNotReady: "containers with unready status: [build helper svc-0]"
ContainersNotReady: "containers with unready status: [build helper svc-0]"
Waiting for pod default/runner-dzfsjrxx-project-25452826-concurrent-0-w2vtpqe5 to be running, status is Pending
ContainersNotInitialized: "containers with incomplete status: [init-permissions]"
ContainersNotReady: "containers with unready status: [build helper svc-0]"
ContainersNotReady: "containers with unready status: [build helper svc-0]"
WARNING: Event retrieved from the cluster: Failed to pull image "alpine333": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/alpine333:latest": failed to resolve reference "docker.io/library/alpine333:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
WARNING: Event retrieved from the cluster: Error: ErrImagePull
WARNING: Event retrieved from the cluster: Error: ImagePullBackOff
WARNING: Failed to pull image with policy "": image pull failed: Back-off pulling image "alpine333"
ERROR: Job failed: prepare environment: waiting for pod running: pulling image "alpine333": image pull failed: Back-off pulling image "alpine333". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information