Skip to content

Vary id list length in LooseForeignKeys::ProcessDeletedRecordsService

LooseForeignKeys::ProcessDeletedRecordsService currently has the following algorithm:

  1. Load up to 500 ids to process for the current table
  2. Run LooseForeignKeys::BatchCleanerService to process deletes on the target table for those ids.
  3. In LooseForeignKeys::BatchCleanerService, use LIMIT ... FOR UPDATE SKIP LOCKED to try to delete efficiently.

This works great in general - it limits the total number of records processed, and it limits the work done by queries in the batch cleaner service.

It creates queries like

delete from target_table where id in (select id from target_table where referenced_id in (<id list of length 500>) limit x for update skip locked)

But when all of the following are true:

  1. the number of ids is large, around the max of 500.
  2. the target table size is relatively small
  3. The cardinality of the column in the target table is low enough that postgres estimates that the 500 ids are a large portion of the table, but high enough that the most-commmon-values list doesn't protect against this mis-estimate.

This can degrade to a sequential scan even with a proper index on the table. For example, see https://gitlab.com/gitlab-com/request-for-help/-/issues/3588#note_2826467364.

This occurs because postgres thinks it can find enough rows matching the many ids early into a sequential scan of the table, since there are so many of them. If this assumption is wrong the query performance can be very bad.

To defend against this, we can:

  1. Catch statement timeouts and record query duration for the delete.
  2. In response to a statement timeout or query > 1s in duration, reduce both the limit in the delete and also the number of ids considered. This will simplify the query, helping to avoid the plan flip and make at least a bit of progress on the LFK operation.
Edited by Simon Tomlinson