Allow to configure when to retry builds (!21758) · Merge requests · GitLab.org / GitLab FOSS

What does this MR do?

Sometimes our build fails because of a system runner failure. Even when docker connection is much more stable now (#2408 (closed)), sometimes the setup still fails and we get a "system runner failure".

It would be nice to have an automatic way to retry such failures, but only those failures.

The benefit to only retry system failures is: We have a policy that specs should always work reliably. If a spec fails just some of the times, then it is a failure and automatically retrying it until it passes is no solution.

But if it is a system runner failure, in 99.9 % of the time I hit the "retry" button and it works, because it is something temporary like ...

 Job failed (system failure): Error response from daemon: error creating overlay mount to /var/lib/docker/overlay2

I want to remove the noise from our developers so if they see a "job failed" message they can (mostly) be sure it is because of a spec, not because of a system failure.

By making it possible to define retry as a hash, max is the new way to define the maximum number of retries:

retry: 2

# becomes

retry:
  max: 2

retry: 2 still works to stay backwards compatible.

A new key when can define when to retry:

retry:
  max: 2
  when: runner_system_failure

retry:
  max: 2
  when:
    - runner_system_failure
    - api_failure

Possible whens:

always: same as before and default, retry in every case
an array of failure_reasons when to retry: only retry in case of one of those failure reason

failure_reaons are the keys of CI::Build.failure_reasons, currently:

unknown_failure
script_failure
api_failure
stuck_or_timeout_failure
runner_system_failure
missing_dependency_failure
runner_unsupported

Examples

Retry always, maximum 5 times

retry: 5

# or

retry:
  max: 5

# or

retry:
  max: 5
  when: always

Retry once in case of api failure or system runner failure, but not on any other failure

retry:
  max: 1
  when:
    - api_failure
    - runner_system_failure

Only retry in case of runner system failure

retry:
  max: 1
  when:
    - runner_system_failure

# or

retry:
  max: 1
  when: runner_system_failure

Current State & Questions (last updated 12.10.2018 16:45 CEST / 14:45 GMT)

State

Tested in production at my own GitLab 11.3.4 instance, feature seems to work correctly as it should.

Questions

Is is OK to expose the internal failure_reason values in the retry config options to the end user?
- OK as per comment https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21758#note_101767080
How to write the documentation? Integer style deprecated? See https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21758#note_102741035
- Written "side by side" currently (see https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/21758#note_103141963)
More tests needed?
Squash/cleanup commits?
Refactoring?

What are the relevant issue numbers?

closes gitlab-runner#3515 (closed)
gitlab-org/gitlab-ce#49634
gitlab-com/support-forum#3710

Does this MR meet the acceptance criteria?

Changelog entry added, if necessary
Documentation created/updated
Tests added for this feature/bug
Conforms to the code review guidelines
Conforms to the merge request performance guidelines

Edited Nov 07, 2018 by Grzegorz Bizon

Allow to configure when to retry builds