Self-heal pipelines stuck with an executing status with no executing builds
Problem
Rarely, we encounter issues where all builds in a pipeline have finished, but the PipelineProcessWorker never executes to update the pipeline status to match its workers. This leaves pipelines in an inconsistent state where they appear to be running despite all constituent jobs being complete.
The most recent example arose from an incident and bug in concurrency_limit functionality on sidekiq jobs:
- https://gitlab.com/gitlab-org/gitlab/-/work_items/580466
- gitlab-com/gl-infra/production#20833 (closed)
Previous discussions about self-healing these issues have been blocked by the challenge of efficiently finding pipelines with all finished builds across the entire GitLab instance.
Solution
Implement a "stuck pipeline worker" similar to the existing stuck build worker functionality.
The worker would:
- Identify pipelines in an executing state that have stale
updated_attimestamps - Re-trigger
PipelineProcessWorkerfor these pipelines - Use the fact that a pipeline's
updated_attimestamp refreshes each time a build changes status (due to theprocessedchanging) Target pipelines that haven't been updated for many hours- Ideally pipelines will self-heal as soon as possible (minutes), however as we reduce this threshold we need to ensure we a) only target pipelines that are genuinely stuck, and b) can query for stuck pipelines efficiently under incident/degraded throughput conditions (we do not want to end up in a situation where the recovery worker can be overloaded by too many stuck pipelines and fall behind, as then we are back to square one).
This approach is safe to implement since running PipelineProcessWorker on an actively running pipeline will simply reflect the current status of the pipeline's jobs without causing harm. The worker is Idempotent.
Things to think about
- We can't always just allow the pipeline to be finished by retrying jobs. Some jobs should not be executed even though they haven't been. Think deployment jobs that need to execute within a certain window. We can't just execute them haphazardly
- When using the
Ci::CancelPipelineService- we also need to update the pipeline itself. Because all this service does is update the jobs, but if there are no jobs that can be canceled, the pipeline status would not change.