Change postgres logging configuration

Production Change - Criticality 4 C4

Change Objective	Describe the objective of the change
Services Impacted	postgres
Change Team Members	@abrandl
Change Severity	C4
Buddy check or tested in staging	Review + staging
Schedule of the change	tbd
Duration of the change	2 hours

Steps (for gprd):

Prepare:

Prepare chef change for environment https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1348
Stop chef-client: knife ssh 'roles:gprd-base-db-patroni' 'sudo service chef-client stop'
Merge chef change https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1348

Pick single replica to apply the change,
Verify it's a replica: sudo gitlab-psql -c 'select pg_is_in_recovery()' should yield true

Drain replica from database traffic (runbook): consul maint -enable -service=patroni-replica -reason="Production issue 891"
Check traffic was drained (wait until this returns): while [ $(sudo pgb-console -c 'SHOW CLIENTS;' | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l) -gt 0 ]; do echo "."; sleep 1; done

Run chef-client to apply the logging changes
Have patroni restart the database: gitlab-patronictl restart pg-ha-cluster $(hostname -f) --force
Verify database has recovered: gitlab-psql -c 'select 1' should not return an error
Verify replication lag is < 100 MB: while [ $(gitlab-patronictl list | grep $(hostname -f) | cut -d '|' -f 7 | awk '{$1=$1};1') -gt 100 ]; do echo "."; sleep 1; done

Run gitlab-psql -c "select pg_sleep(2), 'loggingchange'" and find that line in Kibana (for this host)

Disable replica maintenance: consul maint -disable -service=patroni-replica
Verify, traffic is back on the database (good when this returns): while [ $(sudo pgb-console -c 'SHOW CLIENTS;' | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l) -eq 0 ]; do echo "."; sleep 1; done
Wait until metrics for this replica look "normal" again (the reason we are fuzzy here is that we lack experience as to what specifics to look for, see comments below).

Rinse and repeat for other replicas.

Apply chef change to patroni-04 by running: chef-client
Perform graceful switchover (local from chef repo): bin/graceful-failover gprd patroni-01-db-gprd.c.gitlab-production.internal
Rebuild postgres dead tuple stats with ANALYZE VERBOSE (during this time we'd expect false-positive dead tuple alerts) - takes up to 1 hour

Capture metrics to understand the impact of restarts better, so that we know how to automate this. See https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7179#note_188871425

Edited Jul 12, 2019 by Alex Hanselka