Backend: Job ignores previous stage if cancelled and retried

Summary

A pipeline which uses stages to make sure deployment can't happen without tests can be tricked. As a consequence, deployments / builds are possible even if tests are failing.

In a stage-based pipeline where deploy-job is supposed to depend on test-job based on the implicit stage dependencies, it's still possible to retry the deploy-job if you cancel it while the test-job is running. Retrying the deploy-job in this case will simply insert it into the running queue disregarding the status of the test-job.

Steps to reproduce

Have a pipeline similar to this:

test:
  stage: test
  script:
    - # Tests are running here
deploy:
  stage: deploy
  script:
    - # Building and deploying to production

While the test job runs, cancel the deploy job.
Now click "retry" on deploy job
The job is directly picked by a runner.
- If dependencies were set, the job does fail due to missing dependencies. This is ugly but not a dramatic failure
- If no dependencies were set, the job just runs trough. This is dramatic

This can not only happen if you are a rouge user but also inadvertently, e.g. you accidentally cancelled a job and then want to have it back in the pipeline.

Example Project

This CI: https://gitlab.com/PhilLab/testing-private/-/blob/a6514903e34322dc92319fec4fad33e66b3c2259/.gitlab-ci.yml Leads to a successful deployment even though the tests failed:

https://gitlab.com/PhilLab/testing-private/pipelines/120920659

What is the current bug behavior?

Stages which are meant to block subsequent procedures can be jumped over. When a job is restarted, it is not respecting the sequence of stages and waiting to execute only when jobs in the preceding stages have successfully completed.

What is the expected correct behavior?

Retrying the job should not directly start it but insert it back into the usual processing queue.

Expected behavior as noted in the docs:

If all jobs in a stage succeed, the pipeline moves on to the next stage.
If any job in a stage fails, the next stage is not (usually) executed and the pipeline ends early

Proposal

solve the problem in the backend by not enqueueing the job directly and use Ci::ProcessPipelineService which will look at the whole pipeline. If the deploy job meets the dependencies it will run, otherwise it will remain in created state until the test job completes. https://gitlab.com/gitlab-org/gitlab/-/issues/352858

In !83730 (diffs, comment 915332037) we discussed that we should use Ci::PipelineCreation::StartPipelineService (to be renamed to Ci::Pipelines::StartService) since this service creates a persistent ref (pipeline run prerequisites) and runs Ci::ProcessPipelineService.

In fact, we should use StartPipelineService not only in PlayBuildService but also in RetryJobService and RetryPipelineService. Basically any time we resume the pipeline execution.

Once we do that we should also remove the creation of persistent_ref in Ci::Build state transition since the persistent_ref will be consistently created when we start/resume the pipeline execution and deleted when the pipeline completes.

Can the same bug exist also with scheduled jobs (when: delayed)?. If so, we would need to run StartPipelineService also in Ci::RunScheduledBuildService.

Output of checks

This bug happens on GitLab.com

Possible related issues

I can't link to fixes, but the fix might be in the same realm as some of the other retry bugs:

Testing Consideration

In terms of testing, let's include a test in the integration level with the FE to test this exact scenario.

Create a pipeline with 2 stages.
Run the pipeline.
Cancel the first stage.
Retry the second stage.
Validate that the second stage fails and returns an error.

Edited Apr 26, 2022 by Mark Nuzzo