Skip to content

Draft: Improve Kubernetes executor's pod ready detection

Arran Walker requested to merge ajwalker/fix-kubernetes-terminated-container into main

What does this MR do?

Improves the pod ready detection and handles cases where the Pod is "ready", but actually has unready/terminated containers.

The error as to why the pod fails is now reported, rather than being silently ignored.

Why was this MR needed?

Fixes an issue where a pod is advertised as ready, despite the build container failing to even start/being terminated. I think there's a few cases where this can happen, but can easily occur for Windows if you specify "pwsh" as a shell, but use a job image that doesn't contain it.

What's the best way to test this MR?

On a Kubernetes cluster with Windows nodes, specify pwsh as the shell, but use a nanoserver image for the job (which doesn't include pwsh).

Before this MR, the error response is: ERROR: Job failed (system failure): prepare environment: unable to upgrade connection: 404 request not found..

After this MR, the error response is still rather cryptic, but is the response containerd/docker returns if you try to start a container with an entrypoint that doesn't exist.

What are the relevant issue numbers?

Closes #29103

Merge request reports