Change drop_stuck fetching strategy to use each_batch on ci_pending_builds & fix stuck jobs bug

What does this MR do and why?

Context

In #595153 (closed), we discovered that drop_stuck is incorrectly using break instead of next when iterating through each pending build. This means it skips processing the remaining jobs as soon as it encounters one non-stuck job.

To fix this, we need to update it to next, however with the current fetch batching strategy, it causes an infinite loop if any of the pending jobs are non-stuck. See details in #595216 (closed)'s description.

First we considered updating the fetch batching strategy to use keyset pagination, however this proved to be difficult to optimize as there aren't any existing indexes on p_ci_builds that support it (and we can't add new ones). The database queries involved would perform very poorly.

Fortunately, we have an alternative route. There exists a ci_pending_builds table that is several orders of magnitude smaller than p_ci_builds and is not partitioned. ci_pending_builds is supposed to contain all current pending builds, where created_at effectively represents the time at which the job transitioned to pending. The table is kept in sync with p_ci_builds because we push/pop off records as the transitions happen.

This MR

  1. Updates drop_stuck to query from ci_pending_builds and use each_batch.

    • The new query also leverages partition pruning on p_ci_builds.
    • An index is added to support the batching queries.
  2. Fixes bug in #595153 (closed) because we're using next instead of break in the new batching strategy.

    • This means more stuck jobs should be properly dropped after an hour instead of waiting to be dropped by drop after 24 hours.

The changes are behind a feature flag: drop_stuck_builds_from_ci_pending_builds_queue. Roll-out issue: #595334 (closed)

References

Database query plans

See comments.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #595216 (closed)

Edited by Narendran

Merge request reports

Loading