Interpret failed pods as system failures rather than script failures for Kubernetes executor (!4698) · Merge requests · GitLab.org / gitlab-runner · GitLab

Daniel Barnes requested to merge dbarnes3/gitlab-runner:kube-failure-system-failure into main Mar 27, 2024

What does this MR do?

This merge request will ensure pods that are marked as "Failed" are reported as system failures rather than script failures.

Why was this MR needed?

We are encountering issues where pods are pre-empted or their nodes are shut down, and their failures are being interpreted as script failures.

As such, we are having trouble putting in a retry policy that does not also retry jobs that failed for genuine reasons.

What's the best way to test this MR?

I am planning on running this on our infrastructure, and evicting the node that the pod is running on to force a "pod failed" status.

What are the relevant issue numbers?

I did not see any issue numbers that were relevant.