Add feature flag to fail jobs with expired JWT job token
What does this MR do and why?
Runner can hit 403 when using an expired CI job token JWT, which can leave the job stuck in running. This MR adds a pair of feature flags that, when both enabled, should help fix this:
-
ci_job_token_decode_ignore_expirationto decode the JWT payload to identify the job even when the token is expired. The expiration is then checked manually and passed via a new exception object. This is a separate feature flag to derisk the token handling change. It needs to be rolled out first. - Building on the above, we introduce a failure reason and, behind the
fail_job_on_expired_tokenflag, mark the associated job as failed when we catch the new expiration exception.
References
Rollout issues:
Screenshots or screen recordings
| Before | After |
|---|---|
|
|
How to set up and validate locally
1. Reproduce the bug
Feature.enable(:ci_job_token_jwt)- To speed things up, apply the following patch:
diff --git a/lib/ci/job_token/jwt.rb b/lib/ci/job_token/jwt.rb index a71a7d926573..0f3bb364736c 100644 --- a/lib/ci/job_token/jwt.rb +++ b/lib/ci/job_token/jwt.rb @@ -90,8 +90,7 @@ def subject_type end def expire_time(job) - ttl = [::JSONWebToken::Token::DEFAULT_EXPIRE_TIME, job.timeout_value.to_i].max - Time.current + ttl + LEEWAY + Time.current + [5.seconds, job.timeout_value.to_i].max end def key - Restart rails (
gdk restart rails) to make sure those changes have been picked up - Create a project with the following
.gitlab-ci.yml:test: timeout: 15s script: - i=1; while [ $i -le 60 ]; do echo $i; sleep 10; i=$((i + 1)); done - exit 0
The job should print a few lines, but eventually stop printing and get stuck in running indefinitely.
2. Smoke test for ci_job_token_decode_ignore_expiration
Feature.enable(:ci_job_token_decode_ignore_expiration)- Verify the bug is still present
- Verify that a minimal pipeline works:
test: script: exit 0
3. Test the full fix
Feature.enable(:fail_job_on_expired_token)- Run the timeout pipeline, verify that it now fails with the new timeout error
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Edited by Hordur Freyr Yngvason

