Draft: Improve delete object specs (!129212) · Merge requests · GitLab.org / GitLab

Erick Bajao requested to merge eb-improve-delete-object-specs into master Aug 14, 2023

This MR depends on !129214 (merged) being merged first. Will rebase on master once the first MR is merged. For now, only look at the last commit.

These specs were extracted from the original MR (!128279 (merged)) that was reverted (!128995 (merged)) because of these specs intermittently failing.

After further investigation on what was causing the flakiness, it turns out it was of inconsistent behavior of the query planner during the loading of batch to update here:

module Ci
  class DeleteObjectsService
    # ...

    def load_next_batch
      # `find_by_sql` performs a write in this case and we need to wrap it in
      # a transaction to stick to the primary database.
      Ci::DeletedObject.transaction do
        Ci::DeletedObject.find_by_sql([next_batch_sql, new_pick_up_at: RETRY_IN.from_now])
      end
    end

    def next_batch_sql
      <<~SQL.squish
      UPDATE "ci_deleted_objects"
        SET "pick_up_at" = :new_pick_up_at
        WHERE "ci_deleted_objects"."id" IN (#{locked_object_ids_sql})
        RETURNING *
      SQL
    end

    # ...
  end
end

Actual query:

  TRANSACTION (0.2ms)  BEGIN

    UPDATE "ci_deleted_objects" SET "pick_up_at" = '2023-08-12 14:34:22.294497' WHERE "ci_deleted_objects"."id" IN (SELECT "ci_deleted_objects"."id" FROM "ci_deleted_objects" WHERE (pick_up_at < '2023-08-12 14:24:22.290522') ORDER BY "ci_deleted_objects"."pick_up_at" ASC LIMIT 100 FOR UPDATE SKIP LOCKED) RETURNING * 

  TRANSACTION (0.1ms)  COMMIT

In the case of the flaky specs:

# spec/services/ci/delete_objects_service_spec.rb
it 'limits the number of records removed' do
  stub_const("#{described_class}::BATCH_SIZE", 1)

  expect { execute }.to change { Ci::DeletedObject.count }.by(-1)
end

it 'removes records in order' do
  stub_const("#{described_class}::BATCH_SIZE", 1)

  execute

  expect { past_ready.reload }.to raise_error(ActiveRecord::RecordNotFound)
  expect(ready.reload.present?).to be_truthy
end

These were intermittently failing because even though we stub BATCH_SIZE to 1, the LIMIT 1 condition is not consistently followed by the UPDATE query. Reference https://stackoverflow.com/questions/75755863/limit-1-not-respected-when-used-with-for-update-skip-locked-in-postgres-14.

So in some cases depending on which route the query planner decides to use based on the state of the test database, it would delete more than what was expected.

Edited Aug 14, 2023 by Erick Bajao

Draft: Improve delete object specs

Merge request reports