Change drop_stuck batching strategy from loop-based to paginated (#595216) · Issues · GitLab.org / GitLab

Change drop_stuck batching strategy from loop-based to paginated

The following discussion from !229441 should be addressed: - [ ] @lma-git started a [discussion](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/229441#note_3203662763): (+1 comment) > `@fabiopitino` > > Oof. This test currently fails because of a `stack level too deep` error, which actually reveals a bigger bug that was formerly covered up by [using **break**](https://gitlab.com/gitlab-org/gitlab/-/blob/91602cfa087c971c8e7d0e99cfe1c1c54b024a2f/app/services/ci/stuck_builds/drop_helpers.rb#L22). > > The issue lies with the [`fetch` loop logic](https://gitlab.com/gitlab-org/gitlab/-/blob/91602cfa087c971c8e7d0e99cfe1c1c54b024a2f/app/services/ci/stuck_builds/drop_helpers.rb#L29-41). This method implicitly relies on _all_ `pending` builds of each batch to be dropped so that eventually `jobs.empty?` becomes `true`. However, this doesn't happen with `drop_stuck` because non-stuck jobs remain in `pending` status, which means the `fetch` loop keeps re-querying them from the database. It loops back and forth between `fetch` and `drop_stuck` because `jobs.empty?` never resolves to `true`. > > Relevant code for reference: > > ```ruby > def drop_stuck(builds, failure_reason:) > fetch(builds) do |build| > next unless build.stuck? > > drop_build :stuck, build, failure_reason > end > end > > # rubocop: disable CodeReuse/ActiveRecord > def fetch(builds) > loop do > jobs = builds.includes(:tags, :runner, project: [:namespace, :route]) > .limit(BATCH_SIZE) > .to_a > > break if jobs.empty? > > jobs.each do |job| > Gitlab::ApplicationContext.with_context(project: job.project) { yield(job) } > end > end > end > # rubocop: enable CodeReuse/ActiveRecord > ``` > > > This bug was covered up because using **break** exits both loops entirely. So perhaps it was intentionally used to avoid this error, but it makes the `drop_stuck` logic inconsistent with its purpose. > > This behaviour is one of the main reasons I tend to discourage loop-based batching. Not only is it more easily subject to these kinds of bugs, it's not ideal from a database perspective because its [performance degrades with each subsequent loop](https://gitlab.com/gitlab-org/gitlab/-/work_items/544662). > > The solution is to use a keyset paginated loop for `fetch` instead. And we should do this before committing this MR (moving it into Draft). However, it's not so straightforward to implement because (a) many queries use these drop helpers and (b) each query must be evaluated for performance to ensure it has the proper index support. So we could start with a first iteration to only update `drop_stuck` with the new paginated loop logic (perhaps behind an FF to be safe, if deemed necessary). > > Wdyt? ## Proposal Change the `fetch` batching strategy from loop-based to paginated for `drop_stuck` only. This then allows us to fix the `drop_stuck` logic in https://gitlab.com/gitlab-org/gitlab/-/work_items/595153. ### Update Unfortunately, applying keyset pagination directly to the [`pending_builds`](https://gitlab.com/gitlab-org/gitlab/-/blob/efd93eb0cb75cf5ee6792c2a5601ff3f2021b1ce/app/services/ci/stuck_builds/drop_pending_service.rb#L31-37) query is difficult to optimize. And we're restricted from adding more indexes to `p_ci_builds`, so its performance would be very poor. However, we can take a different approach that avoids the performance challenges all together. Instead of fetching the builds from `p_ci_builds`, we can fetch them from `ci_pending_builds`. This way the fetching strategy can be simplified to use single-column batching on `id`. The queries on the table would be very efficient because `ci_pending_builds` is relatively small most of the time.

issue