Migration to GCP. Database-related improvements (the roadmap)
Postgres-related TODO items for GCP migration:
-
Enable streaming replication from Azure master to GCP replica. WAL shipping (based on WAL-E) remains as backup. migration#715 (closed) (old issue related to this: migration#170 (closed)) -
Benchmark Postgres on Azure and GCP instances: we want to ascertain that the GCP instance can handle the load. -
Prep: -
Benchmark https://gitlab.com/gitlab-com/infrastructure/issues/4632 -
pgbench benchmarks -
disk-level benchmarks -
"real workload" benchmarks
-
-
-
Benchmark network bandwidth and latency between Azure and GCP: this will aid in predicting timings during the migration. -
Setup a “tombstone” table, to help signal the end of the data replication migration#716 (closed) -
After downtime starts, update pg_hba.conf and reload the configuration to prevent writes in Azure. migration#717 (closed) -
Develop a switch-back mechanism: -
based on logical decoding -
additionally, based on physical replication
-
-
Do not use repmgr
for switchover. Switchover to a designated Postgres GCP host using basic/native way to do it (pg_ctl .. promote
or the trigger file) migration#718 (closed) -
Review pgbouncer configuration on GCP -
Study potential data-loss scenario with pgbouncer and Consul DNS. Implement either STONIH and/or adjust pgbouncer DNS TTL configuration. -
Review all database-related steps of the migration migration#724 (closed) -
script all pre- and post-checks -
add more health checks for every step -
add time estimates for all steps
-
-
Turn SSL on for pgbouncers (double-check performance) -
Do at least 5 DB-only or full migration trials using final version of the procedure, without code changes ("code freeze") -
Analyze the current workload on Azure, the current and short-term future needs for GCP DB nodes, compare with the benchmarked capacity and GCP specs https://gitlab.com/gitlab-com/infrastructure/issues/4650 (related: https://gitlab.com/gitlab-com/infrastructure/issues/4632) -
Reduce disk reads and writes on the master. Currently (observed visually, with iotop
, during ~10min periods): 100-200 MB/s reads, 50-100 MB/s writes, spikes up to 1GB/s reads, up to 200 MB/s writes.-
Aggressive autovacuum is the main contributor here, has to be tuned to be less aggressive. The good goal would be to have <<100MB/s for reads and <<50 MB/s for writes. https://gitlab.com/gitlab-com/infrastructure/issues/4653 -
gpg
is among the main write-I/O contributors. Consider moving WAL encryption away from the master https://gitlab.com/gitlab-com/infrastructure/issues/4654
-
-
default_statistics_target = 1000
has to be applied to all nodes before the migration #105 (closed) -
Double-check that all scripts stopping Postgres, will not fail due to active sessions.