Org Mover: Copy PG data during org maintenance window
Problem
We need to copy the organization's PG data during the maintenance window.
(In the future, we will use logical replication, which will only need to catch up with any remaining data changes.)
Proposal
At a high-level, we need to:
- Generate all of the SQL copy commands for all of the tables
- Run the SQL copy commands against the legacy cell's replica DB(s)
- Generate all of the insert commands for all of that data
- Run the insert commands against the target cell's DB(s)
More details
https://gitlab.com/gitlab-org/gitlab/-/issues/458759#note_1971410175
In the future when we work on
COPY
approach implementation – worth including to the future blueprint:
- due to FK cycles, it is worth using the trick
SET session_replication_role = 'replica';
, but with additional check performed before the actualCOPY
action- PG16 used in destination gave us 16h for the GitLab namespaces, but we believe we'll be able to reach <10h timing using several optimizations (not only parallelization for a single table)
- If we use
session_replication_role = replica
, it's worth having pre-step that tests data to ensure that we don't have dependant records thatCOPY
doesn't take -- theprepare
stage- Additionally, we could have a
dry-run
stage: runCOPY
to some destination cluster that is not in use, to see how it works, to ensure that there are no errors and actually measure the timing. So the workflow could be:prepare
->dry-run
(optional) ->run
.
Additionally, we could have a
dry-run
stage: runCOPY
to some destination cluster that is not in use, to see how it works, to ensure that there are no errors and actually measure the timing. So the workflow could be:prepare
->dry-run
(optional) ->run
.Steps:
- Create a temporary target database for the 'dry-run' step.
- Duplicate the schema (without data) to the target database.
- Perform the COPY with restore to the target database. Example:
- Drop a temporary target database.
psql ... -c "create database org_mover_tmp_jhw4if7 owner gitlab" pg_dump ... -d gitlabhq_production --schema-only | psql ... -d org_mover_tmp_jhw4if7 psql ... -d gitlabhq_production -c "COPY ( SELECT * FROM <tabe_name> WHERE ... IN (...) ) TO STDOUT ..." | psql ... -d org_mover_tmp_jhw4if7 -c "COPY <tabe_name> FROM STDIN ..." "COPY ... " "COPY ... " psql ... -c "drop database org_mover_tmp_jhw4if7 with (force)"
due to FK cycles, it is worth using the trick
SET session_replication_role = 'replica';
, but with additional check performed before the actualCOPY
actionI added here the output of cycles in debug mode, an example:
[WARNING] Cyclic dependencies detected: [["ci_pipelines", "ci_pipelines"], ["merge_requests", "merge_request_diffs", "merge_requests"], ["vulnerabilities", "vulnerability_occurrences", "vulnerabilities"], ["integrations", "integrations"]]
If we use
session_replication_role = replica
, it's worth having pre-step that tests data to ensure that we don't have dependant records thatCOPY
doesn't take -- theprepare
stageI added here a check for the presence of all dependent tables and if some tables are missing, the script will fail and will not generate COPY commands
Example:
./generate-selects-for-organization --root-namespace-ids *** --psql "psql -h localhost -p 6015 -U vitaliy -d gitlabhq_dblab -c" [ERROR] Missing tables to cover all dependencies: ["jira_connect_installations", "namespaces", "epics", "sprints", "users", "milestones", "projects", "cluster_agents", "vulnerabilities", "vulnerability_scanners", "bulk_imports", "audit_events_instance_external_streaming_destinations", "zoekt_nodes", "zoekt_enabled_namespaces", "audit_events_instance_external_audit_event_destinations", "emails", "gitlab_partitions_dynamic.ci_builds_101", "p_ci_builds", "ci_builds", "gitlab_partitions_dynamic.ci_builds_102", "plans", "audit_events_group_external_streaming_destinations", "reviews", "geo_nodes", "ci_runners", "search_indices"]