Improve retry mechanism when objects are stale for pipeline cancellation
What does this MR do and why?
Related to #382065 (closed) - Closing a merge train MR doesn't cancel pipelines that contain child pipelines
This code improves the reliability of canceling CI/CD pipelines by adding better error handling and retry logic.
The main changes include:
-
Before this MR: We
retry_lockaround each pipeline within the service. Theretry_lockapplies a transaction around each pipelines job update work so that if 3 jobs fails to update it rolls back the whole pipeline, and stops executing the service. This could leave children pipeline un-canceled. -
After this MR: We
retry_lockaround each job batch because that shortens the transaction and lets each job retry individually if there is another job update going on.
These changes make the pipeline cancellation process more robust by handling race conditions and conflicts that can occur when multiple users or automated systems try to update the same jobs simultaneously. The result is fewer failed cancellation attempts and more reliable cleanup of running build processes.
Logs
We see that Ci::CancelPipelineService can trigger the StaleObjectError
