Geo: Handle toggling between legacy and v2 of project repo replication
<!--IssueSummary start--> <details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=563394) </details> <!--IssueSummary end--> <details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=563394) </details> ## Overview Once [#546175](https://gitlab.com/gitlab-org/gitlab/-/work_items/546175) is complete, `ProjectRepository` becomes the replicable model under V2. However, `Project` records still fire Geo events via `geo_handle_after_update` / `geo_handle_after_create` / `geo_handle_after_destroy`, because `Project` still includes `Geo::ReplicableModel` and `Geo::VerifiableModel`. Under V2, these events are redundant and wasteful: the `ProjectRepositoryReplicator#should_publish_replication_event?` guard already drops them, but the calls still happen on every repository write. This issue ensures that, when `geo_project_repository_replication_v2` is enabled, **Geo events no longer fire for `Project` records**, and that toggling the FF in either direction is seamless — with no data loss and without requiring re-replication of already-synced repositories. --- ## Context: how event hooks work today The consolidation work from [#437929](https://gitlab.com/gitlab-org/gitlab/-/work_items/437929) is largely complete. `EE::Repository#log_geo_updated_event` is now the single source of truth for `geo_handle_after_update` calls: ```ruby # ee/app/models/ee/repository.rb def log_geo_updated_event return unless ::Gitlab::Geo.primary? case container when Project, DesignManagement::Repository container.geo_handle_after_update when ProjectWiki project.wiki_repository&.geo_handle_after_update # ... end end ``` `geo_handle_after_update` delegates to `ProjectRepositoryReplicator#geo_handle_after_update`, which calls `publish(:updated, ...)`. The replicator already has a guard: ```ruby # ee/app/replicators/geo/project_repository_replicator.rb def should_publish_replication_event? return false unless super return true if model_record.is_a?(::Project) # always fires under V1 self.class.geo_project_repository_replication_v2_enabled? && model_record.is_a?(::ProjectRepository) end ``` So under V2, `Project`-based events are published but then silently dropped. The goal of this issue is to stop them from being published at all. --- ## What needs to be done ### 1. Gate `Project` event publishing under V2 In `should_publish_replication_event?`, flip the logic so that `Project`-based events are only published when V2 is **disabled** (i.e. V1): ```ruby def should_publish_replication_event? return false unless super if model_record.is_a?(::Project) return !self.class.geo_project_repository_replication_v2_enabled? end self.class.geo_project_repository_replication_v2_enabled? && model_record.is_a?(::ProjectRepository) end ``` ### 2. Suppress Project create/destroy hooks under V2 `geo_handle_after_create` and `geo_handle_after_destroy` are triggered by `after_create_commit` / `after_destroy` callbacks in `Geo::ReplicableModel` (which `Project` includes). Override these in the replicator to no-op when V2 is enabled and `model_record` is a `Project`: **File:** `ee/app/replicators/geo/project_repository_replicator.rb` (preferred) or `ee/app/models/ee/project.rb` ```ruby def geo_handle_after_create return if model_record.is_a?(::Project) && self.class.geo_project_repository_replication_v2_enabled? super end def geo_handle_after_destroy return if model_record.is_a?(::Project) && self.class.geo_project_repository_replication_v2_enabled? super end ``` Note: Change 3 below handles the `ProjectRepository` destroy case separately and explicitly; these overrides only suppress the Project path. ### 3. Ensure `ProjectRepository` events fire correctly on create/destroy Under V2, `ProjectRepository` records are the source of truth. Their `after_create_commit` and `after_destroy` callbacks (from `Geo::ReplicableModel`, included via `EE::ProjectRepository`) must fire correctly. Verify that: - Creating a `ProjectRepository` record publishes a `created` event. - Destroying a `ProjectRepository` record publishes a `deleted` event. - These are gated by `should_publish_replication_event?` returning `true` only under V2. When a project is destroyed, `ProjectRepository` is cleaned up asynchronously via the loose FK mechanism (SQL DELETE, bypasses ActiveRecord callbacks), so `ProjectRepository#after_destroy` never fires. To fix this, we capture a reference to `project_repository` before destruction and explicitly call `geo_handle_after_destroy` on it once the project is confirmed destroyed, mirroring the same pattern used for `stale_secrets_manager`. This ensures the Geo delete event is published with the correct `model_record_id`, allowing secondaries to clean up the registry record correctly. In `ee/app/services/ee/projects/destroy_service.rb`: ```ruby def execute stale_secrets_manager = project.secrets_manager stale_project_repository = project.project_repository super.tap do if project&.destroyed? mirror_cleanup(project) deprovision_secrets_manager(stale_secrets_manager) stale_project_repository&.geo_handle_after_destroy <= ensures end end end ``` `geo_handle_after_destroy` on a `ProjectRepository` calls the replicator with `model_record = project_repository`, so the event is published with `project_repository.id`. This is the correct ID for the secondary to look up and clean up the `ProjectRepositoryRegistry` record. This is only needed when V2 is ON (Change 2 suppresses the `Project` path). The `geo_handle_after_destroy` call on `ProjectRepository` is already gated by `should_publish_replication_event?`, which requires V2 to be enabled for ProjectRepository events. ### 4. `ProjectRepositoryReplicator#model_record` override (FF race condition) If the FF state changes between event publish and consume (secondary can lag ~1 minute), two failure modes exist: - `ActiveRecord::RecordNotFound` — `model.find(id)` fails because the ID doesn't exist in the current model's table - **Silent wrong record** — the ID coincidentally exists in the current model's table but belongs to a different entity (Project IDs and ProjectRepository IDs are independent sequences and can collide) Catching `RecordNotFound` alone is insufficient. We need to know what model class the ID was generated from. The following steps are a proposition from Claude: **Step 1:** Store the model class in the event payload by overriding `event_params`: ```ruby def event_params super.merge("model_class" => model_record.class.name) end ``` **Step 2:** Override `consume` to capture the stored model class hint before `model_record` is lazily loaded: ```ruby def consume(event_name, **event_data) @stored_model_class = event_data["model_class"] super end ``` **Step 3:** Override `model_record` to detect both failure modes and translate: ```ruby def model_record record = super # If we know what class was stored, check for a type mismatch (wrong record found) return record if @stored_model_class.nil? || record.class.name == @stored_model_class translate_model_record(record.id) rescue ActiveRecord::RecordNotFound # ID doesn't exist in the current model's table — also translate # Might only be worth translating when we're dealing with a Project or ProjectRepo translate_model_record(model_record_id) end def translate_model_record(id) if self.class.geo_project_repository_replication_v2_enabled? # V2 ON but received a Project ID — navigate Project → ProjectRepository ::Project.find_by(id: id)&.project_repository else # V2 OFF but received a ProjectRepository ID — navigate ProjectRepository → Project ::ProjectRepository.find_by(id: id)&.project end end ``` For events predating this change (no `model_class` in payload), `@stored_model_class` is nil and only the `RecordNotFound` path applies — imperfect but no worse than before. Updating `@model_record_id` ensures subsequent `registry` lookups (via `registry_class.for_model_record_id(model_record_id)`) also use the translated ID. ### 5. `ProjectRepositoryRegistry.for_model_record_id` safety net Defense-in-depth for events already in the Sidekiq queue when fixes are deployed (published with a Project ID, consumed after V2 is ON). `project_id` is always populated and unique in the registry, making it a reliable fallback. **File:** `ee/app/models/geo/project_repository_registry.rb` ```ruby def self.for_model_record_id(id) return find_or_initialize_by(project_id: id) unless ProjectRepositoryReplicator.geo_project_repository_replication_v2_enabled? find_by(project_repository_id: id) || find_or_initialize_by(project_id: id) end ``` ### 6. Verify state records still propagate under both versions** `Project` still includes `Geo::VerifiableModel`, so `project_states` records are written on save regardless of the FF. This must remain true under V2 for backward compatibility (dual-write is handled in [#591309](https://gitlab.com/gitlab-org/gitlab/-/work_items/591309)). Confirm that suppressing replication events does not accidentally suppress state record writes. --- ## Implementation plan ### Phase 1: Core event suppression 1. **Update `ProjectRepositoryReplicator#should_publish_replication_event?`** (Change 1) - Modify the method to suppress `Project`-based events when V2 is enabled - `Project` events should only publish when V2 is **disabled** (V1 mode) - Add unit tests asserting no events are published for `Project` records when V2 is enabled 2. **Override `geo_handle_after_create` and `geo_handle_after_destroy`** (Change 2) - Add overrides in `ProjectRepositoryReplicator` to no-op for `Project` records when V2 is enabled - Add unit tests for both methods under V1 and V2 scenarios ### Phase 2: ProjectRepository lifecycle events 3. **Handle `ProjectRepository` destroy via `Projects::DestroyService`** (Change 3) - Modify `ee/app/services/ee/projects/destroy_service.rb` to capture `project_repository` before destruction - Explicitly call `geo_handle_after_destroy` on the stale `ProjectRepository` after project destruction - Add integration tests verifying `ProjectRepository` delete events are published with correct IDs 4. **Verify `ProjectRepository` create events** (Change 3) - Add/update specs confirming `created` events fire for `ProjectRepository` records under V2 - Ensure events are correctly gated by `should_publish_replication_event?` ### Phase 3: FF toggle race condition handling 5. **Store model class in event payload** (Change 4, Step 1) - Override `event_params` in `ProjectRepositoryReplicator` to include `model_class` - Add specs verifying the payload includes the correct model class 6. **Capture stored model class on consume** (Change 4, Step 2) - Override `consume` method to extract `@stored_model_class` from event data - Add specs for the consume override 7. **Implement `model_record` translation logic** (Change 4, Step 3) - Override `model_record` to detect type mismatches and `RecordNotFound` errors - Implement `translate_model_record` helper to navigate between `Project` and `ProjectRepository` - Add comprehensive specs for: - V2 ON receiving a Project ID (translate to ProjectRepository) - V2 OFF receiving a ProjectRepository ID (translate to Project) - Legacy events without `model_class` in payload ### Phase 4: Registry safety net 8. **Update `ProjectRepositoryRegistry.for_model_record_id`** (Change 5) - Add fallback logic to find by `project_id` when `project_repository_id` lookup fails under V2 - Add specs for the fallback behavior with both V1 and V2 scenarios ### Phase 5: Verification and testing 9. **Verify `project_states` writes are unaffected** (Change 6) - Add/update specs confirming state records are written regardless of FF state - Ensure event suppression does not impact `Geo::VerifiableModel` behavior 10. **Add FF toggle integration tests** - Test V1 → V2 toggle: no duplicate events, existing synced repos remain synced - Test V2 → V1 toggle: no missing events, graceful handling of in-flight events - Test events published under one FF state and consumed under another --- ## Follow-up :warning: When the FF is removed and V2 becomes the only version, the `Project`-based event paths, the `project_states` table, and the associated DB triggers should be cleaned up. A separate issue should be created for this at that point.
issue