Allow to configure when to retry builds
What does this MR do?
Sometimes our build fails because of a system runner failure. Even when docker connection is much more stable now (#2408 (closed)), sometimes the setup still fails and we get a "system runner failure".
It would be nice to have an automatic way to retry such failures, but only those failures.
The benefit to only retry system failures is: We have a policy that specs should always work reliably. If a spec fails just some of the times, then it is a failure and automatically retrying it until it passes is no solution.
But if it is a system runner failure, in 99.9 % of the time I hit the "retry" button and it works, because it is something temporary like ...
Job failed (system failure): Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2
I want to remove the noise from our developers so if they see a "job failed" message they can (mostly) be sure it is because of a spec, not because of a system failure.
By making it possible to define retry
as a hash, max
is the new way to define the maximum number of retries:
retry: 2
# becomes
retry:
max: 2
retry: 2
still works to stay backwards compatible.
A new key when
can define when to retry:
retry:
max: 2
when: runner_system_failure
retry:
max: 2
when:
- runner_system_failure
- api_failure
Possible when
s:
-
always
: same as before and default, retry in every case - an array of
failure_reason
s when to retry: only retry in case of one of those failure reason
failure_reaon
s are the keys of CI::Build.failure_reasons, currently:
- unknown_failure
- script_failure
- api_failure
- stuck_or_timeout_failure
- runner_system_failure
- missing_dependency_failure
- runner_unsupported
Examples
Retry always, maximum 5 times
retry: 5
# or
retry:
max: 5
# or
retry:
max: 5
when: always
Retry once in case of api failure or system runner failure, but not on any other failure
retry:
max: 1
when:
- api_failure
- runner_system_failure
Only retry in case of runner system failure
retry:
max: 1
when:
- runner_system_failure
# or
retry:
max: 1
when: runner_system_failure
Current State & Questions (last updated 12.10.2018 16:45 CEST / 14:45 GMT)
State
Tested in production at my own GitLab 11.3.4 instance, feature seems to work correctly as it should.
Questions
- Is is OK to expose the internal
failure_reason
values in theretry
config options to the end user? - How to write the documentation? Integer style deprecated? See https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21758#note_102741035
- Written "side by side" currently (see https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21758#note_103141963)
- More tests needed?
- Squash/cleanup commits?
- Refactoring?
What are the relevant issue numbers?
- closes gitlab-runner#3515 (closed)
- gitlab-org/gitlab-ce#49634
- gitlab-com/support-forum#3710
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary -
Documentation created/updated -
Tests added for this feature/bug -
Conforms to the code review guidelines -
Conforms to the merge request performance guidelines