WIP: Reduce the number manual actions for database-related steps
issue: #724 (closed)
TODO:
-
improve work with tombstone: wait until data comes, don't drop DB -
correct stopping/starting repmgr on ALL nodes, do it in the proper order (when stopping, the master goes last) -
double-check that 10 seconds is a good threshold once streaming replication is ON(tombstone solves it, no need to directly check the lag anymore) rework code to use chefintegrate with morchestra-
double-check hostnames -
write to the tombstone table before the check (see #716 (closed)) -
forbid regular connections to Azure nodes -
stop chef, repmgr -
stop consul -
regular postgres switchover -
do not use trigger_file to promote, use /opt/gitlab/embedded/bin/pg_ctl -D /var/opt/gitlab/postgresql/data promote
-
check that new master is not in recovery mode -
all the steps after DB switchover happened: - more checks
- restoring the state
- start repmgr
- start consul
- start chef
-
register all GCP DB nodes in repmgr after switchover, in proper order
Regarding the 10 seconds threshold: right now it's very often much more than 10. Observing prod:
$ for i in {1..5}; do check_gcp_replication_delay; sleep 5; done
Check if GCP delay < 10s): OK (delay: ~4s)
Check if GCP delay < 10s): FAIL (delay: ~11s)
Check if GCP delay < 10s): FAIL (delay: ~18s)
Check if GCP delay < 10s): OK (delay: ~8s)
Check if GCP delay < 10s): FAIL (delay: ~15s)
Edited by Nikolay Samokhvalov