Allow to configure when to retry builds
What does this MR do?
Sometimes our build fails because of a system runner failure. Even when docker connection is much more stable now (#2408 (closed)), sometimes the setup still fails and we get a "system runner failure".
It would be nice to have an automatic way to retry such failures, but only those failures.
The benefit to only retry system failures is: We have a policy that specs should always work reliably. If a spec fails just some of the times, then it is a failure and automatically retrying it until it passes is no solution.
But if it is a system runner failure, in 99.9 % of the time I hit the "retry" button and it works, because it is something temporary like ...
Job failed (system failure): Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2
I want to remove the noise from our developers so if they see a "job failed" message they can (mostly) be sure it is because of a spec, not because of a system failure.
By making it possible to define
retry as a hash,
max is the new way to define the maximum number of retries:
retry: 2 # becomes retry: max: 2
retry: 2 still works to stay backwards compatible.
A new key
when can define when to retry:
retry: max: 2 when: runner_system_failure
retry: max: 2 when: - runner_system_failure - api_failure
always: same as before and default, retry in every case
- an array of
failure_reasons when to retry: only retry in case of one of those failure reason
failure_reaons are the keys of CI::Build.failure_reasons, currently:
Retry always, maximum 5 times
retry: 5 # or retry: max: 5 # or retry: max: 5 when: always
Retry once in case of api failure or system runner failure, but not on any other failure
retry: max: 1 when: - api_failure - runner_system_failure
Only retry in case of runner system failure
retry: max: 1 when: - runner_system_failure # or retry: max: 1 when: runner_system_failure
Current State & Questions (last updated 12.10.2018 16:45 CEST / 14:45 GMT)
Tested in production at my own GitLab 11.3.4 instance, feature seems to work correctly as it should.
- Is is OK to expose the internal
failure_reasonvalues in the
retryconfig options to the end user?
- How to write the documentation? Integer style deprecated? See https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21758#note_102741035
- Written "side by side" currently (see https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21758#note_103141963)
- More tests needed?
- Squash/cleanup commits?