Make it configurable when to retry a job (e.g. only on system failure)
Description
Sometime our build fail because of a system runner failure. Even when docker connection is much more stable now (#2408 (closed)), sometimes the setup still fails and we get a "system runner failure".
It would be nice to have an automatic way to retry such failures, but only those failures.
I know I can set up a general retry, but I explicitly only want system failures to be retried.
The benefit to only retry system failures is: We have a policy that specs should always work reliably. If a spec fails just some of the times, then it is a failure and automatically retrying it until it passes is no solution.
But if it is a system runner failure, in 99.9 % of the time I hit the "retry" button and it works, because it is something temporary like ...
Job failed (system failure): Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2
I want to remove the noise from our developers so if they see a "job failed" message they can (mostly) be sure it is because of a spec, not because of a system failure.
Proposal
Make it possibly to configure retry
. If there are only two different failure type (system
and script
), one could add an optional when
key to retry
:
retry:
count: 2
when:
- system_failure
- script_failure
It should still be allowed to only pass a number to retry
, so the old behaviour will stay (retry always). This means the above is similar to:
retry: 2