Add feature flag to fail jobs with expired JWT job token

What does this MR do and why?

Runner can hit 403 when using an expired CI job token JWT, which can leave the job stuck in running. This MR adds a pair of feature flags that, when both enabled, should help fix this:

  1. ci_job_token_decode_ignore_expiration to decode the JWT payload to identify the job even when the token is expired. The expiration is then checked manually and passed via a new exception object. This is a separate feature flag to derisk the token handling change. It needs to be rolled out first.
  2. Building on the above, we introduce a failure reason and, behind the fail_job_on_expired_token flag, mark the associated job as failed when we catch the new expiration exception.

References

See gitlab-runner#38356

Rollout issues:

Screenshots or screen recordings

Before After
Screenshot_2026-01-13_at_1.07.17_PM Screenshot_2026-01-13_at_1.13.52_PM

How to set up and validate locally

1. Reproduce the bug

  1. Feature.enable(:ci_job_token_jwt)
  2. To speed things up, apply the following patch:
    diff --git a/lib/ci/job_token/jwt.rb b/lib/ci/job_token/jwt.rb
    index a71a7d926573..0f3bb364736c 100644
    --- a/lib/ci/job_token/jwt.rb
    +++ b/lib/ci/job_token/jwt.rb
    @@ -90,8 +90,7 @@ def subject_type
             end
     
             def expire_time(job)
    -          ttl = [::JSONWebToken::Token::DEFAULT_EXPIRE_TIME, job.timeout_value.to_i].max
    -          Time.current + ttl + LEEWAY
    +          Time.current + [5.seconds, job.timeout_value.to_i].max
             end
     
             def key
  3. Restart rails (gdk restart rails) to make sure those changes have been picked up
  4. Create a project with the following .gitlab-ci.yml:
    test:
      timeout: 15s
      script: 
        - i=1; while [ $i -le 60 ]; do echo $i; sleep 10; i=$((i + 1)); done
        - exit 0

The job should print a few lines, but eventually stop printing and get stuck in running indefinitely.

2. Smoke test for ci_job_token_decode_ignore_expiration

  1. Feature.enable(:ci_job_token_decode_ignore_expiration)
  2. Verify the bug is still present
  3. Verify that a minimal pipeline works:
    test:
      script: exit 0

3. Test the full fix

  1. Feature.enable(:fail_job_on_expired_token)
  2. Run the timeout pipeline, verify that it now fails with the new timeout error

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Hordur Freyr Yngvason

Merge request reports

Loading