Feature Request: extend "retry pipeline" functionality to control how retry is performed
Description
The pipeline retry functionality added in 9.5 is a great thing to have. It would be even better if there was some control over the conditions where the retry occurs. In the script that is executed for a particular job, you might be able to tell what the nature of the failure was (and therefore if it's worth retrying at all). This would help to avoid wasting of runner resources in the event that a retry won't help.
Proposal
Overview
I propose adding some new configuration options alongside retry
. I would be open to other ways of implementing this idea, but the most straightforward way of communicating the nature of the failure would be in the return value of the job's script
. Right now, I believe this is only used to determine success (if zero is returned) or failure (if nonzero is returned). Something like:
- retry: 2
- retry_if: $retcode == 2 || $retcode > 10
- retry_delay: 50
In this case, $retcode
would refer to the return value of the job's script. I went with a shell-like syntax here to be amenable to further extension in the future if other types of conditions are desired, but it's just a proposal. retry_delay
would specify a delay (notionally in seconds) before the job is actually retried.
Use cases
I use Google Compute Engine virtual machines to perform builds using GitLab CI's autoscale feature. It would be a big cost-saver to be able to use preemptible instances for those builds, as they are much cheaper per hour. However, they come with an undesirable feature in that they can be terminated at any time, resulting in a failed build. The retry mechanism in 9.5 can be used to help combat this, but in the event of an actual build failure, the job will be retried some number of times, diminishing the cost savings, as you might end up keeping VMs alive for longer than needed.
VM preemption is detectable from within the virtual machine, so it should be possible to write the CI script in such a way that it will exit with an appropriate error code if preemption is imminent. This would signal to GitLab that the build failed, but failed in a way such that retrying the job is acceptable. Likewise, one might want to use a retry_delay
in this case, as it's likely that current cloud utilization conditions are such that other preeemptions are likely in the near future. So, you might want to wait 30 minutes before retrying the build, for example.
Feature checklist
Make sure these are completed before closing the issue, with a link to the relevant commit.
-
Feature assurance -
Documentation -
Added to features.yml