Report deleted pods as a system failure with attach strategy
What does this MR do?
Since when using the attach strategy we monitor the status of the pods at all times we can properly mark it as a system failure. This allows for when e.g. using spot instances to retry the job.
For context: #26856 (comment 410583524)
Why was this MR needed?
Otherwise the reported error was a script failure which should not be the case.
What's the best way to test this MR?
Automated
Run the integration tests:
go test -v -run 'TestDeletedPodSystemFailureDuringExecution' ./executors/kubernetes
Or manually
Start a long-running job, e.g.:
sleep:
script:
- sleep 5000
tags:
- k8s
Get the pod from the job logs and delete it:
kubectl delete pod runner-l8gav8fn-project-15339497-concurrent-0jp6jq
The job should report it as a system failure:
ERROR: Job failed (system failure): pods "runner-l8gav8fn-project-15339497-concurrent-0jp6jq" not found
What are the relevant issue numbers?
Closes #26856 (closed)
Edited by Georgi N. Georgiev