Retry pipeline service does not pick up a retryable bridge/trigger job
Problem
When retrying an entire pipeline that contains a bridge/trigger job, that job is not picked up by the pipeline retry service.
Steps to reproduce
- Create a CI config with a bridge job and other jobs (example config below)
- Create a new pipeline
- Cancel the pipeline
- Retry the pipeline
The pipeline goes from canceled => running => canceled. Without any manual intervention the pipeline magically changes states from running to canceled. Which is the true state of the pipeline because the canceled trigger job never was retried.
pipeline = ::Ci::Pipeline.find(<ID>)
# this will not contain the bridge job
pipeline.retryable_builds
Example CI config
workflow:
name: 'Ruby 3.0 master branch pipeline'
stages:
- build
- test
- deploy
wait_job:
stage: build
script:
- sleep 10
build_job:
stage: build
needs: ["wait_job"]
script:
- echo "finished"
only:
- master
mr_job:
stage: test
script:
- echo "Merge request job"
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
test_job_colorized:
stage: test
script:
- echo -e "\033[32mINF\033[0m \033[1mIgnoring old update for Nodegroup\033[0m \033[36mcreated=\033[0m2025-05-27T22:18:02Z"
- echo -e "\033[31mERROR\033[0m Something failed here"
- echo -e "\033[33mWARN\033[0m This is a warning message"
- echo -e "\033[1;31mFATAL\033[0m Critical error occurred"
test_manual_job:
stage: test
script:
- sleep 10
- echo $TEST_VAR
- echo $TEST_VAR_TWO
when: manual
only:
- master
allow_failure_test_job:
stage: test
script: exit 1
allow_failure: true
test_job_one:
stage: test
script:
- echo $TEST
only:
- master
test_job_two:
stage: test
script:
- echo "testing..."
only:
- master
artifact_job:
stage: deploy
script: echo "some file content" > my_artifact.txt
artifacts:
paths:
- my_artifact.txt
expire_in: 1 week
environment:
name: production
url: https://www.google.com/
only:
- master
coverage_job:
stage: test
script: echo "82.71"
coverage: '/\d+\.\d+/'
tag_job:
stage: test
script: echo "tag job"
only:
- tags
trigger_job:
stage: deploy
trigger: root/downstream-project
Technical Proposal
There a few options here that might depend on customer needs.
- Update
pipeline.retryable_buildsto include triggers. Retry the cancelled trigger job, which retries the downstream
- Pro:
- It matches user expectations when they click "Retry all failed or cancelled jobs"
- The upstream change that necessitated the retry likely affects downstream work
- Cons: A large behavioral change which we haven't seen users requesting in the issues
- Have the trigger job not affect the pipeline status when the trigger is cancelled.
- con:
- If the downstream only is cancelled today the main pipeline is considered cancelled via the mirrored status. which is intentional. This might not be clean to resolve.
- Might not match user expectations for pipeline status.
- pro: low possibility of introducing unexpected behavior
- Hybrid: Allow users to configure the behavior of the trigger job on retry to make backward compatible and accommodate cases where it's not safe to retry the downstream work (not idempotent).
I vote we go with 1 - for a simple default behavior that should meet the needs of most cases
Edited by Allison Browne