Backend: Job ignores previous stage if cancelled and retried
Summary
A pipeline which uses stages to make sure deployment can't happen without tests can be tricked. As a consequence, deployments / builds are possible even if tests are failing.
In a stage-based pipeline where deploy-job
is supposed to depend on test-job
based on the implicit stage dependencies, it's still possible to retry the deploy-job
if you cancel it while the test-job
is running. Retrying the deploy-job
in this case will simply insert it into the running queue disregarding the status of the test-job
.
Steps to reproduce
- Have a pipeline similar to this:
test:
stage: test
script:
- # Tests are running here
deploy:
stage: deploy
script:
- # Building and deploying to production
- While the test job runs, cancel the deploy job.
- Now click "retry" on deploy job
- The job is directly picked by a runner.
- If
dependencies
were set, the job does fail due to missing dependencies. This is ugly but not a dramatic failure - If no dependencies were set, the job just runs trough. This is dramatic
- If
This can not only happen if you are a rouge user but also inadvertently, e.g. you accidentally cancelled a job and then want to have it back in the pipeline.
Example Project
This CI: https://gitlab.com/PhilLab/testing-private/-/blob/a6514903e34322dc92319fec4fad33e66b3c2259/.gitlab-ci.yml Leads to a successful deployment even though the tests failed:
What is the current bug behavior?
Stages which are meant to block subsequent procedures can be jumped over. When a job is restarted, it is not respecting the sequence of stages and waiting to execute only when jobs in the preceding stages have successfully completed.
What is the expected correct behavior?
Retrying the job should not directly start it but insert it back into the usual processing queue.
Expected behavior as noted in the docs:
- If all jobs in a stage succeed, the pipeline moves on to the next stage.
- If any job in a stage fails, the next stage is not (usually) executed and the pipeline ends early
Proposal
solve the problem in the backend by not enqueueing the job directly and use Ci::ProcessPipelineService
which will look at the whole pipeline. If the deploy
job meets the dependencies it will run, otherwise it will remain in created
state until the test
job completes. https://gitlab.com/gitlab-org/gitlab/-/issues/352858
In !83730 (diffs, comment 915332037) we discussed that we should use Ci::PipelineCreation::StartPipelineService
(to be renamed to Ci::Pipelines::StartService
) since this service creates a persistent ref (pipeline run prerequisites) and runs Ci::ProcessPipelineService
.
In fact, we should use StartPipelineService
not only in PlayBuildService
but also in RetryJobService
and RetryPipelineService
. Basically any time we resume the pipeline execution.
Once we do that we should also remove the creation of persistent_ref
in Ci::Build
state transition since the persistent_ref
will be consistently created when we start/resume the pipeline execution and deleted when the pipeline completes.
Can the same bug exist also with scheduled jobs (when: delayed
)?. If so, we would need to run StartPipelineService
also in Ci::RunScheduledBuildService
.
Output of checks
This bug happens on GitLab.com
Possible related issues
I can't link to fixes, but the fix might be in the same realm as some of the other retry bugs:
Testing Consideration
In terms of testing, let's include a test in the integration level with the FE to test this exact scenario.
- Create a pipeline with 2 stages.
- Run the pipeline.
- Cancel the first stage.
- Retry the second stage.
- Validate that the second stage fails and returns an error.