Skip to content

Implement fair queueing for LFK

Adam Hegyi requested to merge 343547-fair-queueing-for-lfk into master

What does this MR do and why?

This change implements fair queueing for the loose foreign keys feature. If the child records of a deleted record batch cannot be cleaned up in 3 consecutive executions then the deleted record batch will be "re-scheduled" 5 minutes later. The re-scheduled might not be the perfect phrase here. We order the records by consume_after when loading a batch, consume_after also equals to the created_at by default. This means that if we don't get so many inserts (deletions), we might load and process the re-scheduled records earlier than 5 minutes.

The aim here is to improve the throughput by tracking the more difficult deletions (many children records) and applying delayed cleanup.

Example scenario: A project is deleted with many ci_builds records

  1. A new row is inserted into the loose_foreign_keys_deleted_records table.
  2. The worker (in <1 minute) loads a batch of rows, including our project deletion row.
  3. The worker starts deleting all children records of the project which is defined in the LFK config.
  4. The worker starts deleting the ci_builds records of the project, at this point the worker is stopped because it's over the limits.
    • A worker is over the limit when it runs for 30s or it processed a lot's of rows at once.
  5. For the current batch, the cleanup_attempts column is incremented.
  6. In one minute, another worker starts and does the same thing.
  7. On the 3rd attempt, the worker reschedules the current batch by updating the consume_after column.

Cleanup cycle for a projects batch problematic records:

  • 14:05 - Batch1 wasn't cleaned up, set cleanup_attempts = 1.
  • 14:06 - Batch1 wasn't cleaned up, set cleanup_attempts = 2.
  • 14:07 - Batch1 wasn't cleaned up, set cleanup_attempts = 3.
  • 14:08 - Batch1 wasn't cleaned up, reschedule Batch1 for a later time (5 minutes later).
  • 14:09 - Batch2 is being processed.
  • 14:10 - Batch3 is being processed.
  • 14:11 - Batch4 is being processed.
  • 14:15 - Batch5 is being processed.
  • 14:16 - A mixture of Batch1 and Batch6 is being processed.

There is a chance where Batch1 would run for a very long time, this would delay the other deletions in the table and effectively clog the queue. By delaying Batch1 a little, we give some time to the other deleted records in the table to get cleaned up.

  • Note 1: we're not dealing with individual records, the cleanup queries are running on batches. This means that if a batch has one record which is "difficult" to clean up then the whole batch will be rescheduled.
  • Note 2: the feature is behind a feature flag: lfk_fair_queueing

How to set up and validate locally

Modify the app/models/loose_foreign_keys/modification_tracker.rb file and change the over_limit? method:

def over_limit?
  true
end

Modify the app/services/loose_foreign_keys/process_deleted_records_service.rb and remove the following line:

break if modification_tracker.over_limit?

This simulates the "many children rows" scenario.

Enable the feature:

Feature.enable(:lfk_fair_queueing)

Insert a DeletedRecord:

record = LooseForeignKeys::DeletedRecord.create!(fully_qualified_table_name: 'public.projects', primary_key_value: 555)
record.reload

Invoke the worker:

LooseForeignKeys::ProcessDeletedRecordsService.new(connection: ApplicationRecord.connection).execute

Verify the increment:

record.reload.cleanup_attempts

Invoke the worker 3 more times and inspect the record:

record.reload

The cleanup_attempts must be 0 and the consume_after should be 5 minutes in the future.

Database

All queries derived from .update_by_partition have similar performance characteristics:

UPDATE query example: https://explain.depesz.com/s/RihN#html

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #343547 (closed)

Edited by Adam Hegyi

Merge request reports