Alerting of Loose Foreign Keys Deleted Records Processing
After identifying a recent problem with LFK Processing: #419119 (closed) We should add some alerting on LFK workers, to make sure they are working as expected, and catching up with the deleted records.
We have some Prometheus Metrics that we can use for this alerting: loose_foreign_key_updates, loose_foreign_key_deletions , loose_foreign_key_incremented_deleted_records, loose_foreign_key_rescheduled_deleted_records
Some ideas on events that we should alert on
- Number of partitions on the
loose_foreign_keys_deleted_recordsonmainorciis higher thanX. Probably 3 is a good number forX - Number of deletions/updates are high for long time. This refers to child records being updated or deleted.
- We hit the time/processed items limits for more than X days on 1 database ? This information might also not be available in the current metrics. Maybe we need to modify the https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/loose_foreign_keys/batch_cleaner_service.rb to add this information.
- We have many pending records in the database > 0 for long time (more than X days) ? This information might not be available in current metrics. Maybe we need to modify the https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/loose_foreign_keys/batch_cleaner_service.rb to add this information.
The Goal
The goal at the end is to make sure that we don't have a big number of pending records in the loose_foreign_key_deleted_records table with status = 1. If we reach this state, we should be alerted to look into what's wrong. At the time of writing this issue, we had 29M pending records on the CI database
An example of a similar alerting MR related to another topic (Consistency Checking): gitlab-com/runbooks!5646 (merged)