Skip to content

WIP: Reduce the number manual actions for database-related steps

Nikolay Samokhvalov requested to merge db_steps_automation into master

issue: #724 (closed)

TODO:

  • improve work with tombstone: wait until data comes, don't drop DB
  • correct stopping/starting repmgr on ALL nodes, do it in the proper order (when stopping, the master goes last)
  • double-check that 10 seconds is a good threshold once streaming replication is ON (tombstone solves it, no need to directly check the lag anymore)
  • rework code to use chef
  • integrate with morchestra
  • double-check hostnames
  • write to the tombstone table before the check (see #716 (closed))
  • forbid regular connections to Azure nodes
  • stop chef, repmgr
  • stop consul
  • regular postgres switchover
  • do not use trigger_file to promote, use /opt/gitlab/embedded/bin/pg_ctl -D /var/opt/gitlab/postgresql/data promote
  • check that new master is not in recovery mode
  • all the steps after DB switchover happened:
    • more checks
    • restoring the state
    • start repmgr
    • start consul
    • start chef
  • register all GCP DB nodes in repmgr after switchover, in proper order

Regarding the 10 seconds threshold: right now it's very often much more than 10. Observing prod:

$ for i in {1..5}; do check_gcp_replication_delay; sleep 5; done
Check if GCP delay < 10s): OK (delay: ~4s)
Check if GCP delay < 10s): FAIL (delay: ~11s)
Check if GCP delay < 10s): FAIL (delay: ~18s)
Check if GCP delay < 10s): OK (delay: ~8s)
Check if GCP delay < 10s): FAIL (delay: ~15s)
Edited by Nikolay Samokhvalov

Merge request reports