Skip to content

Report deleted pods as a system failure with attach strategy

Georgi N. Georgiev requested to merge report_deleted_pod_as_system_failure into master

What does this MR do?

Since when using the attach strategy we monitor the status of the pods at all times we can properly mark it as a system failure. This allows for when e.g. using spot instances to retry the job.

For context: #26856 (comment 410583524)

Why was this MR needed?

Otherwise the reported error was a script failure which should not be the case.

What's the best way to test this MR?

Automated

Run the integration tests:

go test -v -run 'TestDeletedPodSystemFailureDuringExecution' ./executors/kubernetes

Or manually

Start a long-running job, e.g.:

sleep:
    script:
      - sleep 5000
    tags:
      - k8s

Get the pod from the job logs and delete it:

kubectl delete pod runner-l8gav8fn-project-15339497-concurrent-0jp6jq

The job should report it as a system failure:

ERROR: Job failed (system failure): pods "runner-l8gav8fn-project-15339497-concurrent-0jp6jq" not found

What are the relevant issue numbers?

Closes #26856 (closed)

Edited by Georgi N. Georgiev

Merge request reports