Introduce graceful-cancellation state for CI Jobs

Problem to solve

Currently, cancelling a job tells the runner to immediately drop everything and stop executing the current job. This means that after_script does not run for the job, and no opportunity is had to clean up resources that we created during execution.

Intended users

This problem might apply to anyone cancelling a pipeline: manually via the web UI, through the API, or implicitly by kicking off a pipeline while a running pipeline on the same ref has interruptible: true jobs.

Further details

This is a particular change that we can make to support usage of job:after_script #15603 (closed), which has broad support in the community. This will not completely address that issue, as changes will be needed in the runner as well, to halt execution of script and continue on to executing after_script.

gitlab-runner#4843 (closed)

Once this is implemented, we can implement the same action on the pipeline level, which would trigger a graceful cancellation of all associated jobs.

Gracefully cancel entire pipeline: #35358 (closed)

Proposal

Backend

We should add a graceful-cancellation state for running CI jobs. When the runner sees a job in this state, it would stop executing the before_script and script sections of the job, but move on to executing the after_script portion. This would be considered an active, running state for Ci::Build. When after_script is finished executing, the runner will still have time to upload artifacts and the job will enter the existing "canceled" state as it does today.

If we want to transition to something other state based on success or failure of after_script during a graceful shutdown, that's is also possible but not currently in this proposal.

If a job is in either created or pending, it should transition directly to canceled as no execution has happened yet.

Concern: the configuration documentation specifies that after_script is run in a separate shell context. There's emphasis in the original issue that after_script execution needs to be tightly coupled with the job script specifically to have access to data (e.g. docker container IDs) from the script being cancelled. Is this going to be a problem? If so, is there any way to change that?

Frontend

The main Cancel button on the Pipelines page should be a button for a graceful shutdown. Based on community feedback, this seems to be the best choice for most Pipeline cancellations. We can also build a hard-cancel/abort option, which would execute exactly what a cancellation does today - immediately stop processing the current job script.

Permissions and Security

Documentation

Testing

  1. Unit testing for the state transitions
  2. Integration testing? Most of the intended effects will be executed by the runner.

What does success look like, and how can we measure that?

Folks can automatically tear down docker resources when they cancel a job.

What is the type of buyer?

Links / references

Edited by Steve Xuereb