Change postgres logging configuration
C4
Production Change - Criticality 4Change Objective | Describe the objective of the change |
---|---|
Services Impacted | postgres |
Change Team Members | @abrandl |
Change Severity | C4 |
Buddy check or tested in staging | Review + staging |
Schedule of the change | tbd |
Duration of the change | 2 hours |
Status
-
2019-07-08 15:00: Apply changes to all replicas (-1,-2,-3,-5,-6) -
Apply change to current primary (-04)
Plan
Steps (for gprd):
Prepare:
-
Prepare chef change for environment https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1348 -
Stop chef-client
:knife ssh 'roles:gprd-base-db-patroni' 'sudo service chef-client stop'
-
Merge chef change https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1348
Day 1: Execute for each replica instance:
Step 1: Pick replica
-
Pick single replica to apply the change, -
Verify it's a replica: sudo gitlab-psql -c 'select pg_is_in_recovery()'
should yieldtrue
Step 2: Drain traffic
-
Drain replica from database traffic (runbook): consul maint -enable -service=patroni-replica -reason="Production issue 891"
-
Check traffic was drained (wait until this returns): while [ $(sudo pgb-console -c 'SHOW CLIENTS;' | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l) -gt 0 ]; do echo "."; sleep 1; done
Step 3: Apply changes
-
Run chef-client
to apply the logging changes -
Have patroni restart the database: gitlab-patronictl restart pg-ha-cluster $(hostname -f) --force
-
Verify database has recovered: gitlab-psql -c 'select 1'
should not return an error -
Verify replication lag is < 100 MB: while [ $(gitlab-patronictl list | grep $(hostname -f) | cut -d '|' -f 7 | awk '{$1=$1};1') -gt 100 ]; do echo "."; sleep 1; done
Step 3: Verify logging changes
-
Run gitlab-psql -c "select pg_sleep(2), 'loggingchange'"
and find that line in Kibana (for this host)
Step 4: Put traffic back on the replica
-
Disable replica maintenance: consul maint -disable -service=patroni-replica
-
Verify, traffic is back on the database (good when this returns): while [ $(sudo pgb-console -c 'SHOW CLIENTS;' | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l) -eq 0 ]; do echo "."; sleep 1; done
-
Wait until metrics for this replica look "normal" again (the reason we are fuzzy here is that we lack experience as to what specifics to look for, see comments below).
Rinse and repeat for other replicas.
Day 2:
-
Apply chef change to patroni-04
by running:chef-client
-
Perform graceful switchover (local from chef repo): bin/graceful-failover gprd patroni-01-db-gprd.c.gitlab-production.internal
-
Rebuild postgres dead tuple stats with ANALYZE VERBOSE
(during this time we'd expect false-positive dead tuple alerts) - takes up to 1 hour
Rollback
-
Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1348 and merge -
Repeat above procedure for every instance
Finalize
-
Capture metrics to understand the impact of restarts better, so that we know how to automate this. See https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7179#note_188871425
Edited by Alex Hanselka