Skip to content

Report deleted pods as a system failure with attach strategy

What does this MR do?

Since when using the attach strategy we monitor the status of the pods at all times we can properly mark it as a system failure. This allows for when e.g. using spot instances to retry the job.

For context: #26856 (comment 410583524)

Why was this MR needed?

Otherwise the reported error was a script failure which should not be the case.

What's the best way to test this MR?

Automated

Run the integration tests:

go test -v -run 'TestDeletedPodSystemFailureDuringExecution' ./executors/kubernetes

Or manually

Start a long-running job, e.g.:

sleep:
    script:
      - sleep 5000
    tags:
      - k8s

Get the pod from the job logs and delete it:

kubectl delete pod runner-l8gav8fn-project-15339497-concurrent-0jp6jq

The job should report it as a system failure:

ERROR: Job failed (system failure): pods "runner-l8gav8fn-project-15339497-concurrent-0jp6jq" not found

What are the relevant issue numbers?

Closes #26856 (closed)

Edited by Georgi N. Georgiev | GitLab

Merge request reports

Loading