Change drop_stuck fetching strategy to use each_batch on ci_pending_builds & fix stuck jobs bug
What does this MR do and why?
Context
In #595153 (closed), we discovered that drop_stuck is incorrectly using break instead of next when iterating through each pending build. This means it skips processing the remaining jobs as soon as it encounters one non-stuck job.
To fix this, we need to update it to next, however with the current fetch batching strategy, it causes an infinite loop if any of the pending jobs are non-stuck. See details in #595216 (closed)'s description.
First we considered updating the fetch batching strategy to use keyset pagination, however this proved to be difficult to optimize as there aren't any existing indexes on p_ci_builds that support it (and we can't add new ones). The database queries involved would perform very poorly.
Fortunately, we have an alternative route. There exists a ci_pending_builds table that is several orders of magnitude smaller than p_ci_builds and is not partitioned. ci_pending_builds is supposed to contain all current pending builds, where created_at effectively represents the time at which the job transitioned to pending. The table is kept in sync with p_ci_builds because we push/pop off records as the transitions happen.
This MR
-
Updates
drop_stuckto query fromci_pending_buildsand useeach_batch.- The new query also leverages partition pruning on p_ci_builds.
- An index is added to support the batching queries.
-
Fixes bug in #595153 (closed) because we're using next instead of break in the new batching strategy.
- This means more stuck jobs should be properly dropped after an hour instead of waiting to be dropped by
dropafter 24 hours.
- This means more stuck jobs should be properly dropped after an hour instead of waiting to be dropped by
The changes are behind a feature flag: drop_stuck_builds_from_ci_pending_builds_queue. Roll-out issue: #595334 (closed)
References
- Resolves Change drop_stuck batching strategy from loop-b... (#595216 - closed) and Fix jobs left in stuck state (#595153 - closed)
- Feature flag roll-out issue: [FF] `drop_stuck_builds_from_ci_pending_builds_... (#595334 - closed)
Database query plans
See comments.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #595216 (closed)