Skip to content

Retry pipeline service does not pick up a retryable bridge/trigger job

Problem

When retrying an entire pipeline that contains a bridge/trigger job, that job is not picked up by the pipeline retry service.

Steps to reproduce

  1. Create a CI config with a bridge job and other jobs (example config below)
  2. Create a new pipeline
  3. Cancel the pipeline
  4. Retry the pipeline

The pipeline goes from canceled => running => canceled. Without any manual intervention the pipeline magically changes states from running to canceled. Which is the true state of the pipeline because the canceled trigger job never was retried.

pipeline = ::Ci::Pipeline.find(<ID>)

# this will not contain the bridge job
pipeline.retryable_builds

Example CI config

workflow:
  name: 'Ruby 3.0 master branch pipeline'

stages:
    - build
    - test
    - deploy

wait_job:
    stage: build
    script: 
        - sleep 10

build_job:
    stage: build
    needs: ["wait_job"]
    script:
        - echo "finished"
    only: 
        - master

mr_job:
    stage: test
    script:
        - echo "Merge request job"
    rules:
        - if: $CI_PIPELINE_SOURCE == "merge_request_event"

test_job_colorized:
  stage: test
  script:
    - echo -e "\033[32mINF\033[0m \033[1mIgnoring old update for Nodegroup\033[0m \033[36mcreated=\033[0m2025-05-27T22:18:02Z"
    - echo -e "\033[31mERROR\033[0m Something failed here"
    - echo -e "\033[33mWARN\033[0m This is a warning message"
    - echo -e "\033[1;31mFATAL\033[0m Critical error occurred"

test_manual_job:
    stage: test
    script: 
        - sleep 10
        - echo $TEST_VAR
        - echo $TEST_VAR_TWO
    when: manual
    only: 
     - master
     
allow_failure_test_job:
    stage: test
    script: exit 1
    allow_failure: true

test_job_one:
    stage: test
    script: 
        - echo $TEST
    only: 
        - master

test_job_two:
    stage: test
    script: 
        - echo "testing..."
    only: 
        - master

artifact_job:
    stage: deploy
    script: echo "some file content" > my_artifact.txt
    artifacts: 
        paths:
            - my_artifact.txt
        expire_in: 1 week
    environment: 
        name: production
        url: https://www.google.com/
    only: 
        - master

coverage_job:
    stage: test
    script: echo "82.71"
    coverage: '/\d+\.\d+/'

tag_job:
    stage: test
    script: echo "tag job"
    only:
        - tags

trigger_job:
    stage: deploy
    trigger: root/downstream-project

Technical Proposal

There a few options here that might depend on customer needs.

  1. Update pipeline.retryable_builds to include triggers. Retry the cancelled trigger job, which retries the downstream
  • Pro:
    • It matches user expectations when they click "Retry all failed or cancelled jobs"
    • The upstream change that necessitated the retry likely affects downstream work
  • Cons: A large behavioral change which we haven't seen users requesting in the issues
  1. Have the trigger job not affect the pipeline status when the trigger is cancelled.
  • con:
    • If the downstream only is cancelled today the main pipeline is considered cancelled via the mirrored status. which is intentional. This might not be clean to resolve.
    • Might not match user expectations for pipeline status.
  • pro: low possibility of introducing unexpected behavior
  1. Hybrid: Allow users to configure the behavior of the trigger job on retry to make backward compatible and accommodate cases where it's not safe to retry the downstream work (not idempotent).

I vote we go with 1 - for a simple default behavior that should meet the needs of most cases

Edited by Allison Browne