Check for accumulation of unprocessed loose foreign keys deleted records

Adding a new check

This check looks for an accumulation (over 10000) of unprocessed entries in the loose_foreign_keys_deleted_records table that may result in negative performance impacts on the system in a variety of ways. This check notifies the user if there is a large accumulation and refers them to this page with a description of the problem and some possible mitigation approaches.

Verification steps for review

Because this check is only relevant on heavily-used systems it's hard to get a sufficient test case to check, but we simulated a system with a small table of loose foreign keys and manually set two of them to status=1 to test the check with a lower trigger case:

gitlabhq_production=# select * from loose_foreign_keys_deleted_records;
  id  | partition | primary_key_value | status |          created_at           | fully_qualified_table_name |         consume_after         | cleanup_attempts 
------+-----------+-------------------+--------+-------------------------------+----------------------------+-------------------------------+------------------
 2715 |       203 |              1810 |      2 | 2025-11-04 01:00:30.277415+00 | public.ci_runner_machines  | 2025-11-04 01:00:30.277415+00 |                0
 2717 |       203 |              1809 |      2 | 2025-11-04 01:00:30.36289+00  | public.ci_runner_machines  | 2025-11-04 01:00:30.36289+00  |                0
 2712 |       203 |              1304 |      2 | 2025-11-04 01:00:19.36852+00  | public.ci_runners          | 2025-11-04 01:00:19.36852+00  |                0
 2714 |       203 |              1307 |      2 | 2025-11-04 01:00:30.277415+00 | public.ci_runners          | 2025-11-04 01:00:30.277415+00 |                0
 2716 |       203 |              1306 |      2 | 2025-11-04 01:00:30.36289+00  | public.ci_runners          | 2025-11-04 01:00:30.36289+00  |                0
 2718 |       203 |              1780 |      2 | 2025-11-04 15:36:05.018959+00 | public.ci_runner_machines  | 2025-11-04 15:36:05.018959+00 |                0
 2719 |       203 |              1782 |      2 | 2025-11-04 15:36:05.018959+00 | public.ci_runner_machines  | 2025-11-04 15:36:05.018959+00 |                0
 2720 |       203 |              1781 |      2 | 2025-11-04 15:36:05.018959+00 | public.ci_runner_machines  | 2025-11-04 15:36:05.018959+00 |                0
 2713 |       203 |              1806 |      1 | 2025-11-04 01:00:19.36852+00  | public.ci_runner_machines  | 2025-11-04 01:00:19.36852+00  |                0
 2711 |       203 |              1779 |      1 | 2025-11-03 23:36:02.093413+00 | public.ci_runner_machines  | 2025-11-03 23:36:02.093413+00 |                0
(10 rows)
command: (value=$(sudo /opt/gitlab/bin/gitlab-psql -At -c "select count(*) from loose_foreign_keys_deleted_records WHERE status = 1;") && [ $value -ge 1 ] )
spot --ssh-agent -v -p all_playbook.yml -u admin -k ~/.ssh/id_rsa -e GITLAB_VERSION:17.11.1 -n "20251103_135 - run check"
spot v1.19.1-74e1afa-2025-08-29T17:32:47Z
[diana ec2-54-153-2-60.us-west-1.compute.amazonaws.com:22] run task "20251103_135 - run check", commands: 1
[diana ec2-54-153-2-60.us-west-1.compute.amazonaws.com:22] run command "Run the check and store result"
[diana ec2-54-153-2-60.us-west-1.compute.amazonaws.com:22]  > sudo /bin/sh -c /tmp/.spot-2189275476069177344/spot-script1011991653
[diana ec2-54-153-2-60.us-west-1.compute.amazonaws.com:22]  > setvar JSON_DATA_20251103_135_all_diana={ "ref_url": "https://gitlab.com/gitlab-org/gitlab/-/issues/419119", "title": "Check for accumulation of loose foreign keys deleted records", "host": "ec2-54-153-2-60.us-west-1.compute.amazonaws.com", "workaround_url": "https://docs.gitlab.com/development/database/loose_foreign_keys/#accumulation-of-deleted-records", "version_started": "14.0.0", "version_fixed": null, "message": "You have a significant accumulation of unprocessed loose foreign keys deleted records.\n This can cause LooseForeignKeys::CleanupWorker timeouts and performance issues.\n Consider investigating the cleanup process and potentially running manual cleanup\n if the accumulation is severe." }
[diana ec2-54-153-2-60.us-west-1.compute.amazonaws.com:22] completed command "Run the check and store result" {script: /bin/sh -c [multiline script]} (1.179s)
[diana ec2-54-153-2-60.us-west-1.compute.amazonaws.com:22] completed task "20251103_135 - run check", commands: 1 (1.571s)

Closes #135

Author checklist

  • After opening the MR:
    • Set it to the current milestone
    • Ask the Maintainer from the Reviewer roulette suggestion for review

Reviewer checklist

  • I followed the verification steps and confirm the functionality of the new check
    • I executed the check as presented in this MR by running the generated playbook with spot
    • In case of unexpected/odd behavior here, verify the generated playbook to account for potential YAML parsing issues
  • This check does only perform read operations
  • This check does not output more than necessary on stdout for the check to function
  • The message explains what it means when this check does not pass
  • The workaround_url provides actionable information/steps for affected users
    • Consider if a Knowledge Base article should exist to serve as the ideal workaround URL
  • This check is not using the Rails console/runner, or has Maintainer approval for doing so
  • This check is not using a Rake task, or has Maintainer approval for doing so
  • If this is a breaking change check:
    • It has the corresponding xx_breaking_changes tag (xx being the major release version for the change)
    • The workaround_url goes to the entry on the https://docs.gitlab.com/update/deprecations/ page
    • The ref_url goes to the deprecation issue linked from that entry
    • The title is the same as that entry
    • The version_started is equal to the announcement_milestone of the deprecation
    • The version_fixed is equal to the removal_milestone of the deprecation
Edited by Diana Stanley

Merge request reports

Loading