Vary id list length in LooseForeignKeys::ProcessDeletedRecordsService
LooseForeignKeys::ProcessDeletedRecordsService
currently has the following algorithm:
- Load up to 500 ids to process for the current table
- Run
LooseForeignKeys::BatchCleanerService
to process deletes on the target table for those ids. - In
LooseForeignKeys::BatchCleanerService
, useLIMIT ... FOR UPDATE SKIP LOCKED
to try to delete efficiently.
This works great in general - it limits the total number of records processed, and it limits the work done by queries in the batch cleaner service.
It creates queries like
delete from target_table where id in (select id from target_table where referenced_id in (<id list of length 500>) limit x for update skip locked)
But when all of the following are true:
- the number of ids is large, around the max of 500.
- the target table size is relatively small
- The cardinality of the column in the target table is low enough that postgres estimates that the 500 ids are a large portion of the table, but high enough that the most-commmon-values list doesn't protect against this mis-estimate.
This can degrade to a sequential scan even with a proper index on the table. For example, see https://gitlab.com/gitlab-com/request-for-help/-/issues/3588#note_2826467364.
This occurs because postgres thinks it can find enough rows matching the many ids early into a sequential scan of the table, since there are so many of them. If this assumption is wrong the query performance can be very bad.
To defend against this, we can:
- Catch statement timeouts and record query duration for the delete.
- In response to a statement timeout or query > 1s in duration, reduce both the limit in the delete and also the number of ids considered. This will simplify the query, helping to avoid the plan flip and make at least a bit of progress on the LFK operation.
Edited by Simon Tomlinson