Security Database Decomposition Checklist
Security Database Decomposition Checklist
This is the single source of truth for tasks that need to be completed for the Security Database Decomposition project.
Everyone feel free to open their own issues or MRs to track activities/changes, but please link them here so we can keep track of them. Feel free to update this issue with additional tasks, timeline adjustments, and links to related issues/MRs for each task to maintain a single source of truth.
Note on DRIs: I've included DRI placeholders for each section. Instead of section-level DRIs, we can assign DRIs to specific tasks based on expertise and availability. Please feel free to volunteer for specific tasks by adding your name as the DRI next to the relevant items.
Reference:
- Database Decomposition Strategies: Near Zero-Downtime Migration
- Decomposition: Rollout decomposed Sec database on GitLab.com (staging and production)
- Sec Database Decomposition Working Group- epic
Communication and Planning [Required for decomposition]
DRI: @ghavenga @theoretick @rmar1
-
Define the dates for Phase 4 (Read Traffic Validation) and Phase 7 (Production Decomposition) -
Send calendar invites for Phase 4 and Phase 7 change windows
-
-
Define Communications plan and get necessary approvals - https://gitlab.com/gitlab-org/gitlab/-/issues/514047 -
Execute internal and external announcements based on the approved Communications plan
-
-
Open new CRs: -
Phase 4 GSTG CR - Validate Read Traffic Switchover in Staging - gitlab-com/gl-infra/production#19546 (closed) -
Phase 4 GPRD CR - Validate Read Traffic Switchover in Production - gitlab-com/gl-infra/production#19339 (closed) -
Phase 7 GSTG CR - Production Sec Decomposition and Cutover in Staging - blocked on CR template -
Phase 7 GPRD CR - Production Sec Decomposition and Cutover in Production - blocked on CR template
-
Issue Templates and CR Documentation [Required for execution]
DRI: @jjsisson @theoretick
- [-] Create detailed issue template for Phase 4
-
Create detailed issue template for Phase 7 - MR gitlab-com/gl-infra/db-migration!568 (merged) - @jjsisson -
Create Phase 4 CRs for both GSTG and GPRD using the templates -
Create Phase 7 CRs for both GSTG and GPRD using the templates -
Document rollback procedures and decision triggers in CR templates
Ref. Initial Setup
Infrastructure Setup [Required for implementation] -DRI: @jjsisson
- Standup Sec cluster as a physical standby cluster:
-
GSTG -
GPRD
-
- Configure PgBouncers for Sec cluster:
-
GSTG -
GPRD
-
- Reconfigure database pool connections for Sec Cluster's Web and Sidekiq PgBouncers:
-
GSTG -
GPRD
-
- Reconfigure Sec Application to connect to Sec Cluster's Web and Sidekiq PgBouncers:
- Reconfigure database pool connections for Main Cluster's Web and Sidekiq PgBouncers (once Sec application connect directly through the Sec PgBouncers, we need to reduce db pool connections on the Main cluster's PgBouncers):
- Validate physical replication between Main and Sec clusters:
-
GSTG -
GPRD
-
- Configure Consul services appropriately for both clusters:
-
GSTG -
GPRD
-
Monitoring and Observability [Required for validation]
DRI:
-
Build monitoring dashboards for Sec Decomposition process: - gitlab-com/runbooks!8562 (merged) - @jjsisson -
Replication lag monitoring for both GSTG and GPRD -
Traffic distribution metrics before/during/after switchover
-
-
Set up Patroni and PgBouncer dashboards for Sec cluster (mirroring Main cluster dashboards) -
Configure appropriate alerts for the decomposition process
Ansible Playbooks [Required for automation]
DRI: @jjsisson
-
Build Ansible Playbooks for physical-to-logical replication conversion -
Build Ansible Playbooks for Read and RW traffic switchover procedures -
Build Ansible Playbooks for rollback procedures -
Build Ansible Playbooks for write-lock trigger implementation (if applicable) -
Document all playbooks with usage instructions and parameters
db-benchmarking Environment Setup [Required for testing]
DRI:
-
Create Mock Main cluster in db-benchmarking
- Setup db-benchmarking Environment - @jjsisson -
Create physical standby Sec cluster in db-benchmarking
-
Configure PgBouncers for Sec cluster -
Set up Consul services for both clusters
End-to-End Testing in the db-benchmarking Environment [Required before Phase 7]
DRI: @jjsisson @bshah11 @theoretick
-
Validate Ansible playbooks for physical-to-logical replication conversion -
Test write-lock trigger implementation (SQL statement block or Ansible playbook) -
Validate Ansible playbooks for Sec cluster PgBouncer reconfiguration operations (Pause, reconfigure, Reload, Resume) -
Validate Ansible playbooks for Read and RW traffic switchover (simulated without actual site traffic) -
Validate Ansible playbooks for rollback procedures -
Verify monitoring dashboards function correctly during simulated operations -
Document any issues encountered and their resolutions for production implementation
Phase 4: Validate Read Traffic Switchover
Staging (GSTG) Validation - Phase 4 [Required before Production]
DRI: @jjsisson @theoretick
-
Create Application MR for redirecting read traffic to Sec cluster -
Execute Phase 4 GSTG CR: -
Read Traffic Switchover to Sec Cluster -
Monitor application functionality and performance on Sec cluster -
Execute controlled rollback to Main cluster gitlab-com/gl-infra/production#19578 (closed)
-
-
Document findings and update GPRD plan based on GSTG experience @theoretick
Production (GPRD) Execution - Phase 4
DRI: @jjsisson @theoretick
-
Create Application MR for redirecting read traffic to Sec cluster -
Execute Phase 4 GPRD CR: -
Read Traffic Switchover to Sec Cluster -
Monitor application functionality and performance on Sec cluster -
Execute controlled rollback to Main cluster gitlab-com/gl-infra/production#19579 (closed)
-
-
Document findings and update Phase 7 plan based on Phase 4 experience
Phase 7: Production Sec Decomposition and Cutover
Staging (GSTG) Execution - Phase 7 [Required before Production]
DRI:
-
Prepare necessary MRs: -
Application MR for redirecting read traffic to Sec cluster -
Write-lock triggers MR for Security tables in Main cluster (if not handled via Ansible) -
Chef MR to persist PgBouncer configuration changes to master.patroni-sec
-
Additional MRs (monitoring dashboards, Teleport access, etc.)
-
-
Execute Phase 7 GSTG CR: -
Convert Physical to Logical replication -
Read Traffic Switchover to Sec cluster -
RW Traffic Switchover to Sec cluster -
Establish reverse logical replication -
Monitor application functionality and performance
-
-
Document findings and update GPRD plan based on GSTG experience
Production (GPRD) Execution - Phase 7
DRI:
-
Prepare necessary MRs: -
Application MR for redirecting read traffic to Sec cluster -
Write-lock triggers MR for Security tables in Main cluster (if not handled via Ansible) -
Chef MR to persist PgBouncer configuration changes to master.patroni-sec
-
Terraform/Chef MR for Analytics team to use Sec cluster's disk snapshot -
Additional MRs (monitoring dashboards, Teleport access, etc.)
-
-
Determine if Delayed DR cluster is needed for Sec cluster and prepare accordingly -
Execute Phase 7 GPRD CR: -
Convert Physical to Logical replication -
Read Traffic Switchover to Sec cluster -
RW Traffic Switchover to Sec cluster -
Establish reverse logical replication -
Monitor application functionality and performance
-
Post-Decomposition Tasks
DRI:
-
Immediate post-migration validation: -
Verify data consistency and application behavior -
Confirm all services are functioning properly -
Monitor performance metrics for 24-48 hours
-
-
Documentation and cleanup: -
Update documentation with final architecture -
Update issue templates for future decompositions -
Remove temporary configurations
-
-
Cluster optimization: -
Clean up Security tables in the Main cluster gitlab-org/gitlab#523925 (closed) -
Clean up Main tables in the Sec cluster to reclaim disk space -
Develop a Zero-Downtime Sec Cluster resize approach using rolling upgrade and Switchover Patroni Leader playbook -
Resize Sec Cluster based on disk space and CPU requirements https://gitlab.com/groups/gitlab-com/gl-infra/data-access/dbo/-/epics/51
-
-
Project closure: -
Conduct formal retrospective and document lessons learned (https://gitlab.com/gitlab-com/gl-infra/data-access/dbo/dbo-issue-tracker/-/issues/375 and https://gitlab.com/gitlab-org/gitlab/-/issues/536913) -
Create follow-up issues for any identified improvements
-