Consider that occasional job failures can be easily bypassed

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Problem to solve

When using CI/CD with GitLab, there are times when a job may fail that is unexpected, but external factors may allow a pipeline to safely succeed, but only after an investigation. Sometimes parts of a pipeline can't be resolved via a Retry, and in these instances, having the capability to forcibly allow a pipeline continue could potentially be beneficial.

Intended users

Personas are described at https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/ -->

Further details

Example use case, a pipeline that performs a dry run across a fleet of servers followed by a deploy across the same fleet, followed by a set of notifications or cleanup jobs. Each fleet is their own job inside of a stage

If a dry run fails due to some external factor on a node, for example, this will fail the dry run which prevents the next stage of a deployment to continue. A retry of that specific job may not pass either until a node is out of maintenance. After an investigation determines that it's okay to allow the Pipeline to continue, there currently does not exist a method that allows us to forcibly continue. Work would need to be done to add an allow_failure: true or modifications to the underlying deployment mechanism to be added after the investigation, and immediately reverted after the node is out of maintenance.

Note that in this mock scenario the Play All button next to the Deploy stage doesn't perform any actions at all.

We would not want allow_failure: true on the job permanently as this is something that normally shouldn't fail. Again this use case highlights that a job failed, but there's still an okay situation that would allow a pipeline to continue successfully.

Proposal

Provide a mechanism that meets in the middle of allow_failure: true being set on a job, and allowing a Pipeline to continue when a job fails.

Consider that occasional job failures can be easily bypassed

Problem to solve

Intended users

Further details

Proposal

Permissions and Security

Documentation

Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Links / references