CI_JOB_TOKEN and ID_TOKENS invalidated on cancelled jobs
Summary
When using CI_JOB_TOKEN
or ID_TOKENS
in the after_script section of a CI/CD job, it currently behaves differently in the following scenarios listed in our docs.
after_script
commands also run when:
- The job is cancelled while the before_script or script sections are still running.
- The job fails with failure type of script_failure, but not other failure types.
-
In the first case, if the token is used on git operations (e.g git clone), it would raise a permission error in
after_script
section. -
In the second case, if the token is used on git operations (e.g git clone), it would still work fine in the
after_script
section.
It's also worthy to note that when a job is cancelled, it creates a brand new container as opposed to a failed job which re-uses the existing container. This might be up for another issue but it's likely related to how we handle the validity of the tokens.
# example error
git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<group>/<project>.git
Cloning into '<project>'...
remote: HTTP Basic: Access denied. The provided password or token is incorrect or your account has 2FA enabled and you must use a personal access token instead of a password. See https://gitlab.com/help/topics/git/troubleshooting_git#error-on-git-fetch-http-basic-access-denied
fatal: Authentication failed for 'https://gitlab.com/<group>/<project>.git/'
Affected tokens
Steps to reproduce
- Create a Dummy project A.
- Create a project B.
- Add project B to project A's allowlist.
- Create the following
.gitlab-ci.yml
.
stages: # List of stages for jobs, and their order of execution
- build
non-cancelled-job: # This job runs in the build stage, which runs first.
stage: build
variables:
CI_DEBUG_TRACE: true
TEST_VAR: '${CI_JOB_TOKEN}'
GIT_STRATEGY: clone
script:
- echo "Running non-cancelled-job..."
- git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
- ls -la <project>
- exit 1 # force job to fail to trigger after_scriopt
- sleep 300
after_script:
- echo "Execute this command after the `script` section completes."
- ls -la <project>
- rm -rf <project>
- git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
cancelled-job: # This job runs in the build stage, which runs first.
stage: build
variables:
CI_DEBUG_TRACE: true
TEST_VAR: '${CI_JOB_TOKEN}'
GIT_STRATEGY: clone
script:
- echo "Running non-cancelled-job..."
- git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
- ls -la <project>
- sleep 300
after_script:
- echo "Execute this command after the `script` section completes."
- ls -la <project>
- rm -rf <project>
- git clone https://gitlab-ci-token:${CI_JOB_TOKEN}@gitlab.com/<groupA>/<project>.git
- Let the
non-cancelled-job
fail on it's own. - After the
cancelled-job
starts thesleep
command, cancel it. - Observe the permission error on the
after_script
.
Example Project
https://gitlab.com/kballon-bug-report/zd549518_ci_job_token_after_script/-/pipelines/1380128080
What is the current bug behavior?
Tokens encounters a permission error on the cancelled job.
What is the expected correct behavior?
Tokens should not encounter a permission error on the cancelled job.
Relevant logs and/or screenshots
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Proposal
We should be able to fix this in Ci::AuthJobFinder
which uses validate_running_job!
and change that code to check that a job is canceling
or running
. We can use the EXECUTING_STATUSES
constant to check that it is still executing?
instead of running. The executing method would need to be defined.
We should check that the runner side is also equipt to auth the job in after_script since it runs in a separate shell, but I think it should be given kent's description that it's working on script failure. Note: currently we only allow running so this quote from the issue is surprising:
The job fails with failure type of script_failure, but not other failure types.`