Make the retry button more reliable
Release notes
When pushing the retry button on the pipeline/job level, the pipeline retries the correct components to ensure a successful end-result.
Problem to solve
As a user, I want to be able to press the retry button and expect the whole pipeline to function properly. Currently, the pipeline retry button starts with the failed job. That's not totally unexpected, but often does not work, as the previous jobs may have been cleaned up.
For example, our pipeline does the following:
- Create a local docker container
- build the software in the local docker container
- if success, push the container to the registry with a known tag
- cleanup the local container (always)
the following works great, but if the build or push stages fail, (500 errors on gitlab for example) the job fails, but as we don't want to have the temporary container lingering around, that gets cleaned up as part of the 'when: always' clause. This does however mean, that if we retry the pipeline, only the failed jobs are retried, which now fail due to the missing local container (which is named sha:pipelineid to make it unique per pipeline).
Intended users
Personas are described at https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/
- Delaney (Development Team Lead)
- Sasha (Software Developer)
- Devon (DevOps Engineer)
- Rachel (Release Manager)
- Simone (Software Engineer in Test)
User experience goal
As a user, I'm happy that when I push the retry button, the right thing happens and I don't have to figure out what the right thing is.
Proposal
I think there's two things to concider here. For one, the retry button on the pipeline may always run the whole failed pipeline, that's the quickest and simplest, but means we re-run potentially long running jobs again, that where fine before. Then again, the button is on the pipeline level, not on the job level, we have buttons on the job level.
This can be fine-grained, by allowing for a 'pipeline finished' marker to be set in the yaml file. 'If this job finishes, consider the pipeline to have completed (either positively or negatively, but completed). This can then be used for the aforementioned button to do the right thing better, by knowing the pipeline was completed (with a cleanup job) we cannot restart it. We may even make this marker simply a 'stage'. if all jobs of stage 'complete' are successfully finished, the pipeline is considered done. (In that light, I think we want a task for a job, but that's more a 'rules' thing :) that can be marked as 'must always be run, regardless whether a pipeline is being canceled or not).
Finally, IF this is possible already with the 'needs' keyword, this needs to be documented more clearly. E.g. 'remember to mark all your jobs with 'needs docker_prepare' if they cannot be rebuild without it. But it kinda feels like abusing the 'needs' field, even though from a naming it makes sense. But is it fair, to say that the cleanup step (last step) 'needs' the prepare step (the first step)? I was under the impression the needs thing was mostly about the previous step...