Introduce graceful-cancellation state for CI Jobs
Problem to solve
Currently, cancelling a job tells the runner to immediately drop everything and stop executing the current job. This means that after_script does not run for the job, and no opportunity is had to clean up resources that we created during execution.
Intended users
This problem might apply to anyone cancelling a pipeline: manually via the web UI, through the API, or implicitly by kicking off a pipeline while a running pipeline on the same ref has interruptible: true jobs.
Further details
This is a particular change that we can make to support usage of job:after_script #15603 (closed), which has broad support in the community. This will not completely address that issue, as changes will be needed in the runner as well, to halt execution of script and continue on to executing after_script.
Once this is implemented, we can implement the same action on the pipeline level, which would trigger a graceful cancellation of all associated jobs.
Gracefully cancel entire pipeline: #35358 (closed)
Proposal
Backend
We should add a graceful-cancellation state for running CI jobs. When the runner sees a job in this state, it would stop executing the before_script and script sections of the job, but move on to executing the after_script portion. This would be considered an active, running state for Ci::Build. When after_script is finished executing, the runner will still have time to upload artifacts and the job will enter the existing "canceled" state as it does today.
If we want to transition to something other state based on success or failure of after_script during a graceful shutdown, that's is also possible but not currently in this proposal.
If a job is in either created or pending, it should transition directly to canceled as no execution has happened yet.
Concern: the configuration documentation specifies that after_script is run in a separate shell context. There's emphasis in the original issue that after_script execution needs to be tightly coupled with the job script specifically to have access to data (e.g. docker container IDs) from the script being cancelled. Is this going to be a problem? If so, is there any way to change that?
Frontend
The main Cancel
Permissions and Security
Documentation
Testing
- Unit testing for the state transitions
- Integration testing? Most of the intended effects will be executed by the runner.
What does success look like, and how can we measure that?
Folks can automatically tear down docker resources when they cancel a job.