Backend: Jobs that run on_failure are sometimes unexpectedly skipped when they also have optional needs

Summary

When a job has when: on_failure, it should run when at least one other job in the same pipeline fails. When the job that has when: on_failure also has needs, the job is unexpectedly skipped when other jobs in the same pipeline fail.

If the needs are removed: the when: on_failure job works properly: it runs when other jobs in the same pipeline fail.

Steps to reproduce

  1. Use a .gitlab-ci.yml file like the one shown below
  2. Observe that the build job fails
  3. Observe that the rollback job is skipped (The rollback job should run because build failed.)
Expand to see the.gitlab-ci.yml
build_job:
  stage: build
  script:
    - exit 1

test_job:
  stage: test
  script:
  - date

rollback_job:
  stage: deploy
  needs:
    - job: test_job
      optional: true
    - job: build_job
      optional: true
  script:
    - date
  when: on_failure

Proposal

The reason is that we are skipping the job if it is a DAG job and needs any skipped or ignored job; The below condition should be modified to accommodate this scenario for when it occurs.

https://gitlab.com/gitlab-org/gitlab/-/blob/13a24803c5569aea2e62b439991b7a99f4334e50/lib/gitlab/ci/status/composite.rb#L35-37

            if @dag && any_skipped_or_ignored?
              # The DAG job is skipped if one of the needs does not run at all.
              'skipped'

Example Project

This unexpected behavior can be observed in the 🌐 Public gitlab-gold/briecarranza/issues/when-on-failure project.

What is the current bug behavior?

A job with when: on_failure is skipped when it contains needs and at least one job in the pipeline has failed.

Screenshot_2023-01-23_at_7.36.46_PM

What is the expected correct behavior?

A job with when: on_failure and needs should run when at least one other job in in the pipeline has failed.

Screenshot_2023-01-23_at_7.40.59_PM

The screenshot above shows what things should look like. Removing the needs altogether permits things to look like the screenshot above.

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

Possible Workarounds

  • Remove the needs from the rollback job completely
    • This may not be feasible for some environments.
  • Possibly: move the jobs that may fail to the another stage in the pipeline
    • I wrote "sometimes" in the issue title because there is one specific set of circumstances I have identified thus far where the presence of when: on_failure and optional needs do work as expected. See this example pipeline.

A few more thoughts on this:

Observe that the optional needs job fails.

The documentation on needs:optional notes:

To need a job that sometimes does not exist in the pipeline, add optional: true to the needs configuration.

That sounds like it's about the absence or presence of the needed job and not about the success or failure of the job.

  • Is the thought above right?

It is not possible to use allow_failure to work around this because we also note in the docs:

  • If allow_failure: true is set, the job is always considered successful, and later jobs with when: on_failure don’t start if this job fails.e
Edited by 🤖 GitLab Bot 🤖