CI_JOB_TOKEN and ID_TOKENS invalidated on cancelled jobs
Summary
When using CI_JOB_TOKEN
or ID_TOKENS
in the after_script section of a CI/CD job, it currently behaves differently in the following scenarios listed in our docs.
after_script
commands also run when:
- The job is cancelled while the before_script or script sections are still running.
- The job fails with failure type of script_failure, but not other failure types.
-
In the first case, if the token is used on git operations (e.g git clone), it would raise a permission error in
after_script
section. -
In the second case, if the token is used on git operations (e.g git clone), it would still work fine in the
after_script
section.
It's also worthy to note that when a job is cancelled, it creates a brand new container as opposed to a failed job which re-uses the existing container. This might be up for another issue but it's likely related to how we handle the validity of the tokens.
# example error
git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<group>/<project>.git
Cloning into '<project>'...
remote: HTTP Basic: Access denied. The provided password or token is incorrect or your account has 2FA enabled and you must use a personal access token instead of a password. See https://gitlab.com/help/topics/git/troubleshooting_git#error-on-git-fetch-http-basic-access-denied
fatal: Authentication failed for 'https://gitlab.com/<group>/<project>.git/'
Affected tokens
Steps to reproduce
- Create a Dummy project A.
- Create a project B.
- Add project B to project A's allowlist.
- Create the following
.gitlab-ci.yml
.
stages: # List of stages for jobs, and their order of execution
- build
non-cancelled-job: # This job runs in the build stage, which runs first.
stage: build
variables:
CI_DEBUG_TRACE: true
TEST_VAR: '${CI_JOB_TOKEN}'
GIT_STRATEGY: clone
script:
- echo "Running non-cancelled-job..."
- git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
- ls -la <project>
- exit 1 # force job to fail to trigger after_scriopt
- sleep 300
after_script:
- echo "Execute this command after the `script` section completes."
- ls -la <project>
- rm -rf <project>
- git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
cancelled-job: # This job runs in the build stage, which runs first.
stage: build
variables:
CI_DEBUG_TRACE: true
TEST_VAR: '${CI_JOB_TOKEN}'
GIT_STRATEGY: clone
script:
- echo "Running non-cancelled-job..."
- git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
- ls -la <project>
- sleep 300
after_script:
- echo "Execute this command after the `script` section completes."
- ls -la <project>
- rm -rf <project>
- git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
- Let the
non-cancelled-job
fail on it's own. - After the
cancelled-job
starts thesleep
command, cancel it. - Observe the permission error on the
after_script
.
Example Project
https://gitlab.com/kballon-bug-report/zd549518_ci_job_token_after_script/-/pipelines/1380128080
What is the current bug behavior?
Tokens encounters a permission error on the cancelled job.
What is the expected correct behavior?
Tokens should not encounter a permission error on the cancelled job.
Relevant logs and/or screenshots
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Proposal
We should be able to fix this in Ci::AuthJobFinder
which uses validate_running_job!
and change that code to check that a job is canceling
or running
. We can use the EXECUTING_STATUSES
constant to check that it is still executing?
instead of running. The executing method would need to be defined.