Geo: Enumerate project_repositories instead of projects for verification/replication of Git repos
<!--IssueSummary start-->
<details>
<summary>
Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards.
</summary>
- [Work on this issue](https://contributors.gitlab.com/manage-issue?action=work&projectId=278964&issueIid=546175)
- [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=546175)
</details>
<!--IssueSummary end-->
## Overview
This issue is part of a [multi-step implementation plan](https://gitlab.com/groups/gitlab-org/-/epics/17974#implementation-steps) to handle Geo verification/replication of projects without Git repositories, reducing errors caused when Gitaly tries to fetch non-existent repos.
Like Design Management Repositories, we want Geo to enumerate the `project_repositories` table instead of the `projects` table, because a `ProjectRepository` record represents an actual Git repo, not a `Project` itself (in the GitLab Rails code). This will simplify the logic for determining when a Git repo needs to be fetched for verification/replication.
This is the core of **V2 project repository replication**, gated behind the `geo_project_repository_replication_v2` feature flag.
---
## What needs to be done
The branch [`546175-geo-enumerate-project_repositories-instead-of-projects`](https://gitlab.com/gitlab-org/gitlab/-/compare/master...546175-geo-enumerate-project_repositories-instead-of-projects?from_project_id=278964) contains a proof-of-concept that is **not yet mergeable**. The following changes are proposed there and need to be completed:
**`ProjectRepositoryReplicator`** (`ee/app/replicators/geo/project_repository_replicator.rb`)
- `model` is gated to return `::ProjectRepository` when the FF is enabled, `::Project` otherwise.
- `should_publish_replication_event?` correctly gates events to `ProjectRepository` records only under V2.
- `repository`, `pool_repository`, and `object_pool_missing?` are delegated through `model_record` which can now be either a `Project` or a `ProjectRepository`.
**`Geo::ProjectRepositoryRegistry`** (`ee/app/models/geo/project_repository_registry.rb`)
- A `belongs_to :project_repository` association is added alongside the existing `belongs_to :project`.
- `model_class` and `model_foreign_key` are FF-gated to return `ProjectRepository` / `:project_repository_id` under V2.
- A `model_accessor` helper is introduced to return the right model record depending on the FF.
- `before_validation` and `after_find` callbacks call `populate_missing_foreign_keys` to keep both FKs in sync during the V1/V2 migration period.
**`EE::ProjectRepository`** (`ee/app/models/ee/project_repository.rb`)
- `pool_repository`, `object_pool_missing?`, and `last_repository_updated_at` are delegated to `project`, so that `ProjectRepository` can be used as a drop-in for `Project` in the replicator.
**Factories and specs**
- `spec/factories/projects.rb`: `project_with_repo` factory now calls `track_project_repository` to ensure a `ProjectRepository` record exists.
- `ee/spec/factories/geo/project_repository_registry.rb`: factory updated to handle both V1 and V2 cases.
- Specs for `Geo::ProjectRepositoryRegistry`, `GeoNodeStatus`, the data management API, and `Geo::Secondary::RegistryConsistencyWorker` are updated to cover both V1 and V2 paths.
---
## Known hurdle: N+1 queries from the `after_find` callback
The main technical blocker is an N+1 query problem introduced by the `after_find :populate_missing_foreign_keys` callback in `Geo::ProjectRepositoryRegistry`:
```ruby
# ee/app/models/geo/project_repository_registry.rb
after_find :populate_missing_foreign_keys
def populate_missing_foreign_keys
return unless has_attribute?(:project_id) && has_attribute?(:project_repository_id)
return if project.present? && project_repository.present?
self.project_repository ||= project&.project_repository
self.project ||= project_repository&.project
end
```
This callback fires for **every record** loaded from the database. When loading a batch of registry records (e.g. in cron workers), each record may trigger one or two additional SQL queries to load the `project` or `project_repository` association. On large instances with millions of project repositories, this produces a severe N+1 that can degrade performance significantly.
The usual mitigation of adding a scope with `includes` is **not applicable here** because `geo_project_repository_registry` lives in the `geo` database, while `projects` and `project_repositories` live in the `main` database. Cross-database eager loading is not supported.
**Possible approaches:**
- Move the population logic out of `after_find` entirely, and instead resolve the missing FK lazily on demand (e.g. only when `model_accessor` is called), accepting that some records may temporarily have only one FK populated.
- Evaluate whether both FKs need to be populated at all times, or whether the registry can operate correctly with only one FK set and resolve the other only when strictly necessary (e.g. in `populate_missing_foreign_keys` called explicitly from service-layer code, not from a callback).
- Consider whether the `after_find` callback can be removed once the migration period is over and both FKs are always populated by construction.
---
## Acceptance criteria
- [ ] Under `geo_project_repository_replication_v2`, Geo enumerates `project_repositories` instead of `projects` for replication and verification of Git repos.
- [ ] The `after_find` N+1 in `Geo::ProjectRepositoryRegistry` is resolved.
- [ ] All existing Geo shared examples pass for both V1 and V2 paths.
issue