Geo: Track each upload partition separately for replication and verification
## Summary This epic tracks the work to implement Geo replication and verification for each individual upload partition table, rather than using the single `upload_states` table. ## Background As discussed in [!221773](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/221773), specifically in [this comment](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/221773#note_3056075106) by @mkozono and [this follow-up](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/221773#note_3062604622) by @dbalexandre, the team has decided to track each upload partition separately. ### Why track each upload partition separately? **Pros:** - Improve performance of Geo queries when there are millions of uploads - Reduce the friction to add new Geo data types generally - Increase consistency and reliability - Does not need required stops - Anticipates the future work to [stop using the uploads table at all](https://gitlab.com/gitlab-org/gitlab/-/work_items/425484) **Cons:** - Requires more time invested upfront (boilerplate code for each data type) - The Geo sites dashboard will need changes to support many more data types (tracked in [Iteration 4 in Geo Observability Phase 2](https://gitlab.com/groups/gitlab-org/-/work_items/16588)) --- ## Phased Delivery Plan ### Phase 1: Foundation & First Replicator (POC) **Goal:** Validate the approach with a single, low-risk partition | Issue | Purpose | |-------|---------| | #589925 | Reduce SSF boilerplate for upload partition replicators | | #589901 | `AbuseReport` uploads - **First replicator** | **Exit Criteria:** - [ ] Replicator pattern established and documented - [ ] First partition replicating successfully on staging - [ ] Performance baseline captured ### Phase 2: High-Volume Core Partitions **Goal:** Tackle the most impactful partitions early to surface scaling issues | Issue | Table | Rationale | |-------|-------|-----------| | #589915 | `project_uploads` | Highest volume | | #589910 | `namespace_uploads` | Group-level, high usage | | #589918 | `user_uploads` | User avatars, widespread | | #589909 | `design_management_action_uploads` | Large files | **Exit Criteria:** - [ ] High-volume partitions performing well under load - [ ] No degradation in Geo sync times ### Phase 3: Import/Export & Bulk Operations **Goal:** Handle partitions critical for disaster recovery workflows | Issue | Table | |-------|-------| | #589911 | `import_export_upload_uploads` | | #589906 | `bulk_import_export_upload_uploads` | | #589916 | `project_import_export_relation_export_upload_uploads` | | #589919 | `user_permission_export_upload_uploads` | **Exit Criteria:** - [ ] Import/export workflows tested end-to-end with Geo - [ ] Bulk migration scenarios validated ### Phase 4: Security & Compliance Partitions **Goal:** Ensure vulnerability and compliance data replicates correctly | Issue | Table | |-------|-------| | #589920 | `vulnerability_archive_export_uploads` | | #589921 | `vulnerability_export_uploads` | | #589922 | `vulnerability_export_part_uploads` | | #589923 | `vulnerability_remediation_uploads` | | #589907 | `dependency_list_export_uploads` | | #589908 | `dependency_list_export_part_uploads` | **Exit Criteria:** - [ ] Security/compliance data integrity verified ### Phase 5: Remaining Partitions (Long Tail) **Goal:** Complete coverage of all upload types | Issue | Table | |-------|-------| | #589902 | `achievement_uploads` | | #589903 | `ai_vectorizable_file_uploads` | | #589904 | `alert_management_alert_metric_image_uploads` | | #589905 | `appearance_uploads` (sharding key TBD) | | #589912 | `issuable_metric_image_uploads` | | #589913 | `organization_detail_uploads` | | #589914 | `snippet_uploads` | | #589917 | `project_topic_uploads` | **Exit Criteria:** - [ ] All 23 partition replicators implemented - [ ] Full test coverage ### Phase 6: Switchover & Deprecation **Goal:** Migrate from `upload_states` to partitioned tables | Issue | Purpose | |-------|---------| | #589924 | Switch from uploads table to partitioned upload tables | **Activities:** 1. Feature flag rollout (% ramp) 2. Dual-write period for verification 3. Deprecate `upload_states` table usage 4. Update Geo Observability dashboard (coordinate with [&16588](https://gitlab.com/groups/gitlab-org/-/work_items/16588)) **Exit Criteria:** - [ ] 100% traffic on partitioned tables - [ ] Legacy `upload_states` deprecated - [ ] Documentation updated --- ## Phase Summary | Phase | Issues | Focus | Risk | |-------|--------|-------|------| | 1 | 2 | Foundation + POC | Low | | 2 | 4 | High-volume partitions | **High** | | 3 | 4 | Import/Export | Medium | | 4 | 6 | Security/Compliance | Medium | | 5 | 8 | Long tail | Low | | 6 | 1 | Switchover | **High** | | **Total** | **25** | | | ## Key Risks & Mitigations | Risk | Mitigation | |------|------------| | Performance regression on high-volume partitions | Phase 2 tackles these early; establish baselines in Phase 1 | | Dashboard overwhelm (23+ new data types) | Coordinate with [&16588](https://gitlab.com/groups/gitlab-org/-/work_items/16588) before Phase 6 | | `appearance_uploads` sharding key TBD | Resolve in Phase 5; low volume, can defer | | Switchover data integrity | Dual-write period in Phase 6 | --- ## Child Issues Reference | Model | Table Name | Sharding Key | Issue | |-------|------------|--------------|-------| | `AbuseReport` | `abuse_report_uploads` | `organization_id` | #589901 | | `Achievements::Achievement` | `achievement_uploads` | `namespace_id` | #589902 | | `Ai::VectorizableFile` | `ai_vectorizable_file_uploads` | `project_id` | #589903 | | `AlertManagement::MetricImage` | `alert_management_alert_metric_image_uploads` | `project_id` | #589904 | | `Appearance` | `appearance_uploads` | `TBD` | #589905 | | `BulkImports::ExportUpload` | `bulk_import_export_upload_uploads` | `project_id` | #589906 | | `Dependencies::DependencyListExport` | `dependency_list_export_uploads` | `organization_id, namespace_id, project_id` | #589907 | | `Dependencies::DependencyListExport::Part` | `dependency_list_export_part_uploads` | `organization_id` | #589908 | | `DesignManagement::Action` | `design_management_action_uploads` | `namespace_id` | #589909 | | `Group` | `namespace_uploads` | `namespace_id` | #589910 | | `ImportExportUpload` | `import_export_upload_uploads` | `project_id` | #589911 | | `IssuableMetricImage` | `issuable_metric_image_uploads` | `namespace_id` | #589912 | | `Organizations::OrganizationDetail` | `organization_detail_uploads` | `organization_id` | #589913 | | `PersonalSnippet` | `snippet_uploads` | `organization_id` | #589914 | | `Project` | `project_uploads` | `project_id` | #589915 | | `Projects::ImportExport::RelationExportUpload` | `project_import_export_relation_export_upload_uploads` | `project_id` | #589916 | | `Projects::Topic` | `project_topic_uploads` | `organization_id` | #589917 | | `User` | `user_uploads` | `organization_id` | #589918 | | `UserPermissionExportUpload` | `user_permission_export_upload_uploads` | `uploaded_by_user_id` | #589919 | | `Vulnerabilities::ArchiveExport` | `vulnerability_archive_export_uploads` | `project_id` | #589920 | | `Vulnerabilities::Export` | `vulnerability_export_uploads` | `organization_id` | #589921 | | `Vulnerabilities::Export::Part` | `vulnerability_export_part_uploads` | `organization_id` | #589922 | | `Vulnerabilities::Remediation` | `vulnerability_remediation_uploads` | `vulnerability_remediation_uploads` | #589923 | ## Related Links - MR: [!221773](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/221773) - Add sharding key information to Geo upload_states table - [Stop using the uploads table](https://gitlab.com/gitlab-org/gitlab/-/work_items/425484) - [Geo Observability Phase 2 - Iteration 4](https://gitlab.com/groups/gitlab-org/-/work_items/16588)
epic