Implement Sidekiq worker to delete rows from the deleted records table in parallel
MR !69165 (merged) introduces a Sidekiq worker to process rows stored in loose_foreign_keys_deleted_records. This worker runs periodically, and does its work sequentially.
Today I had a call with @ahegyi about helping out with the loose foreign key work, and specifically about processing this table. I brought up the concern that this worker likely won't be able to process data in a timely manner, due to the expected size of the work. This problem is similar to issues we had in the past with workers for pruning old web hook logs and events.
To tackle this issue I proposed the following:
First we add a new table that tracks the individual rows to delete, rather than the root row. So if you delete a project and insert a row for it into loose_foreign_keys_deleted_records, this new table will store data for all the associated rows (issues, comments, etc).
We then split the Sidekiq worker into two stages:
- A stage/worker that processes
loose_foreign_keys_deleted_recordsand inserts the appropriate data into this new table - A stage/worker that processes the new table
This setup is based on the assumption that crawling associations probably takes the most time while introducing the least load, while deletions will likely be fast but introduce load (due to index updates, vacuuming, etc). Splitting this up allows us to control and monitor this more easily. For example, we can have the first stage run frequently, while running the second stage less often.
In addition, processing the work in parallel becomes easier. Crawling can only be done in parallel at the per-row level in loose_foreign_keys_deleted_records. That is, we can schedule N jobs to process X rows each, but each job can only crawl one of those rows at a time. Deleting data on the other hand is embarrassingly parallel: just get N rows from the table, group them per table name, then run a DELETE for the rows in that table.
The downside is a more complex setup, and this new table having to store more data. Most notably it will end up having to store the table name per row that needs to be removed. Assuming a maximum of 63 bytes per table name (as all our names are ASCII only), this results in roughly 1MB of space for every 20 000 rows. This isn't too bad, but it's something we need to be mindful of.
Let's use this issue to see if this is worthwhile to pursue, or if there are perhaps better options available.