Change loose_foreign_keys_deleted_records architecture to be Cell compatible
## Summary
This issue is the result of the investigation done on https://gitlab.com/gitlab-org/gitlab/-/work_items/535508+
Split `loose_foreign_keys_deleted_records` into sharding-key-specific tables so LFK cleanup is compatible with Cells architecture and Organization Mover.
This table is uniquely difficult to shard because it accepts records from every table in the system, including both org-specific and cell-local tables. A single non-nullable sharding key column cannot be applied uniformly, which conflicts with the Cells requirement that every row in an org-scoped table must be attributable to an organization.
### Proposed Solution
Today all deleted record events land in a single cell-local table (`loose_foreign_keys_deleted_records`). This issue will involve the work to route records to one of 4 new tables based on their sharding key (`organization_id`, `namespace_id`, `project_id`, `user_id`), introduce a `DeletedRecordStore` facade that makes all 5 tables behave as a single unit, and gradually roll out the trigger changes. Records without any sharding key continue flowing to the original `loose_foreign_keys_deleted_records` table.
### Architecture
```
DeletedRecordStore (facade)
|
┌─--------──────────-----------------------┬──────────────------------------------------------┬──────────────-----------------------------┬---------───────------------------───────┐
| | | | |
DeletedRecord (cell-local) OrganizationDeletedRecord NamespaceDeletedRecord ProjectDeletedRecord User
| | | | |
loose_foreign_keys_deleted_records loose_foreign_keys_organization_deleted_records loose_foreign_keys_namespace_deleted_records loose_foreign_keys_project_deleted_records loose_foreign_keys_user_deleted_records
```
All 5 models share behavior via `DeletedRecordConcern`. The DB trigger function routes records to the appropriate table based on sharding keys defined in each table's `db/docs/*.yml` dictionary entry.
`DeletedRecordStore` is a facade that queries all 5 models, merges results by `consume_order` (`partition_number`, `consume_after`, `id`), and routes updates back to the correct model via `records.group_by(&:class)`.
### POCs
Below there's a list of POCs we used to back our decision with the chosen architecture
* https://gitlab.com/gitlab-org/gitlab/-/merge_requests/213250
* https://gitlab.com/gitlab-org/gitlab/-/merge_requests/222469
* https://gitlab.com/gitlab-org/gitlab/-/merge_requests/223305
* https://gitlab.com/gitlab-org/gitlab/-/merge_requests/223978
* https://gitlab.com/gitlab-org/gitlab/-/merge_requests/224599
* https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961
## Implementation Plan
* Phase 1 Foundation - Create 4 new empty tables, models, concerns, and partitions. Nothing writes to new tables yet.
* Phase 2 DeletedRecordStore - Create a facade that defaults to the current LFK table behind a feature flag.
* Phase 3 Trigger function update + first table - Update the trigger function to support sharding key routing (backward compatible) and wire up one table's triggers end-to-end.
* Phase 4 Gradual trigger rollout - Roll out trigger rewrites to remaining tables.
* Phase 5 Cleanup
### Phase 1 - tables, models, concern, partitions
Creates the basic foundation. New tables are empty. No trigger writes to them.
### MR: Create sharded LFK tables and model infrastructure
- [ ] Create migration to add [4 new partitioned tables](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-faf4e97a4f00f186f28e75d5851bf27399e355f7) (`loose_foreign_keys_organization_deleted_records`, `loose_foreign_keys_namespace_deleted_records`, `loose_foreign_keys_project_deleted_records`, `loose_foreign_keys_user_deleted_records`), each partitioned by LIST on `partition`.
- [ ] Each table should have the same structure from `loose_foreign_keys_deleted_records`
- [ ] Create [`DeletedRecordConcern` (`app/models/concerns/loose_foreign_keys/deleted_record_concern.rb`)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-a2a4f2c711b2137cf4c1f6770fcd0a703c860b42) to share the same behavior for all the models.
- [ ] Refactor [`DeletedRecord`](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-46ade0ca735a3dd67f075fb5d44be8144ff757da) to `include DeletedRecordConcern` (removes ~90 lines of duplicated logic)
- [ ] Create [4 new models](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-5ef8c4d7dae3cbc8284da590e6e59388487f7ea0): `OrganizationDeletedRecord`, `NamespaceDeletedRecord`, `ProjectDeletedRecord`, `UserDeletedRecord`. Each includes the concern and adds sharding key validation + scope.
- [ ] Register 4 new models in `config/initializers/postgres_partitioning.rb`
- [ ] Verify `spec/models/loose_foreign_keys/deleted_record_spec.rb` passes (validates concern refactor)
Acceptance criteria: All existing LFK specs pass. `DeletedRecord` behavior identical what we have today.
---
## Phase 2 — DeletedRecordStore structure behind feature flag
Adds a new feature flag defaulting to off. With flag on, store queries empty sharded tables introduced in phase one (safe no-op).
### MR 2a: Services accept `record_store:` parameter
- [ ] Add [`record_store:`](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-9af1c20e976396d9d17ce53476df9cb5190856fd) keyword argument to `ProcessDeletedRecordsService` (default: `LooseForeignKeys::DeletedRecord`)
- [ ] Add [`record_store:`](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-9e2f7b544c4d6f6e69377f8a4ef293f505c2f2d6) keyword argument to `BatchCleanerService` (default: `LooseForeignKeys::DeletedRecord`)
- [ ] Fix `BatchCleanerService#db_config_name` to use the proper model connection instead of `LooseForeignKeys::DeletedRecord.connection`
- [ ] Replace 3 hardcoded `LooseForeignKeys::DeletedRecord` calls in `BatchCleanerService` with `record_store`
- [ ] Update service specs
Acceptance criteria: All service and worker specs pass. Existing callers unaffected.
### MR 2b: DeletedRecordStore + workers behind feature flag
- [ ] Create [`Gitlab::LooseForeignKeys::DeletedRecordStore`](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-d9ea776db9c63aae9656dcd2d708da8d8c27ac79) facade
- [ ] Add feature flag definition for `use_loose_foreign_keys_deleted_record_store`
- [ ] Update [`CleanupWorker`](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-667d5a3be6c0a9edf238c740244ad7af1126c138), add `record_store` method with flag check, pass to service
- [ ] Update [`CiPipelinesBuildsCleanupCronWorker`](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-8ccdfcb535cba609c84d39afd2a899b2dcdbc450), same pattern
- [ ] Update [`MergeRequestDiffCommitCleanupWorker`](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-4e624ef1ee249c70647818f644a791eeac263224) same pattern
- [ ] Update worker E2E specs
Acceptance criteria: Specs pass with flag on and off.
### Post-deploy action
```shell
/chatops run feature set use_loose_foreign_keys_deleted_record_store true
```
Validate store works in production against empty sharded tables. Monitor Sidekiq logs and `loose_foreign_key_processed_deleted_records` counter. Beside having 4 new models and a facade, everything should keep working through the DeletedRecord as usual.
---
## Phase 3 — Trigger function update
Validates full end-to-end flow before broad rollout.
### MR 3a: Update shared trigger functions
- [ ] Create migration to [update](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/231961/diffs#diff-content-b01d63eea1ac0c298219a86c12752e9ee4901e3e) the `insert_into_loose_foreign_keys_deleted_records()` and `insert_into_loose_foreign_keys_deleted_records_override_table()` to accept sharding key JSON.
- [ ] Add fallback: when no sharding keys passed or all values NULL, insert into cell-local table. If table has null sharding keys, it must also fallback to cell-local.
- [ ] Add validations to fail CI/Specs in case we have unknown sharding key.
- [ ] Add a new temporary method to `LooseForeignKeyHelpers` helper to proper create and update the triggers via migration. We would later, move this as the default method for handling the sharding keys triggers.
- [ ] Update `db/structure.sql`
Acceptance criteria: No behavior change. Existing triggers don't pass sharding keys, so functions hit fallback path.
### MR 3b: rewrite trigger for medium size/traffic table (project_repositories?)
- [ ] Create migration that rewrites to the chosen table trigger using updated LooseForeignKeyHelpers helper (automatically includes sharding keys from `db/docs/merge_request_diffs.yml`)
Acceptance criteria: Migration runs cleanly.
### Post-deploy monitoring
```sql
-- Records should appear in project table
SELECT count(*) FROM loose_foreign_keys_<sharding key>_deleted_records
WHERE fully_qualified_table_name = 'public.choosen_table' AND status = 1;
-- Cell-local table should stop receiving records for this table
SELECT count(*) FROM loose_foreign_keys_deleted_records
WHERE fully_qualified_table_name = 'public.choosen_table' AND status = 1;
```
Verify:
- [ ] Records route to `ProjectDeletedRecord`
- [ ] Workers process them through `DeletedRecordStore`
- [ ] `chose table` child records cleaned up correctly
- [ ] No increase in `loose_foreign_key_rescheduled_deleted_records`
- [ ] Stable for at least 1-2 days before proceeding
---
## Phase 4 — Gradual trigger rollout
Same approach as Phase 3, applied in batches.
### MR 4c: Batch 1 — Tables with other sharding keys
- [ ] Rewrite triggers for tables routed to `organization_id`, `namespace_id`, or `user_id`
- [ ] Validates that non `project_id` routing paths
- [ ] Monitor 1-2 days
### MR 4a: Batch 2 — CI tables (dedicated worker)
- [ ] Rewrite triggers for `p_ci_pipelines` and `p_ci_builds`
- [ ] These use `CiPipelinesBuildsCleanupCronWorker`. Issues isolated from main worker
- [ ] Monitor 1-2 days
### MR 4b: Batch 3 — High-traffic main tables
- [ ] Rewrite triggers for high-traffic tables with `project_id` sharding key processed by main `CleanupWorker`
- [ ] Monitor 1-2 days
### MR 4d: Batch 4 — Remaining tables
- [ ] Rewrite triggers for all remaining tables
- [ ] Can use bulk migration targeting only tables not yet migrated
- [ ] Monitor 1-2 days
### Monitoring between batches
```sql
SELECT 'deleted_records' AS source, count(*) FROM loose_foreign_keys_deleted_records WHERE status = 1
UNION ALL
SELECT 'organization', count(*) FROM loose_foreign_keys_organization_deleted_records WHERE status = 1
UNION ALL
SELECT 'namespace', count(*) FROM loose_foreign_keys_namespace_deleted_records WHERE status = 1
UNION ALL
SELECT 'project', count(*) FROM loose_foreign_keys_project_deleted_records WHERE status = 1
UNION ALL
SELECT 'user', count(*) FROM loose_foreign_keys_user_deleted_records WHERE status = 1;
```
- [ ] Pending counts for migrated tables decrease in sharded tables
- [ ] No new pending records for migrated tables in cell-local table
- [ ] Prometheus counters healthy
- [ ] No Sidekiq errors
- [ ] Add `sharding_keys_args` and `sharding_keys_for` helpers to `LooseForeignKeyHelpers`
---
## Phase 5 — Cleanup
Once all triggers have migrated and are stable for 3-4 weeks.
### MR: Remove feature flag and finalize
- [ ] Remove `use_loose_foreign_keys_deleted_record_store` feature flag YAML
- [ ] Remove conditional logic from workers hardcode `Gitlab::LooseForeignKeys::DeletedRecordStore`
- [ ] Update `ProcessDeletedRecordsService` default from `DeletedRecord` to `DeletedRecordStore`
- [ ] Update `doc/development/database/loose_foreign_keys.md`
- [ ] Update `spec/support/shared_examples/loose_foreign_keys/have_loose_foreign_key.rb`
- [ ] Update `spec/lib/gitlab/database/migration_helpers/loose_foreign_key_helpers_spec.rb`
Related #535508
issue