Increase Patroni's patience when talking with Consul

Production Change - Criticality 1 C1

Context and motivation for this change:

This change aims to reduce (but not eliminate) the frequency of unnecessary Patroni failover events, while still allowing Patroni to failover promptly when necessary.

Failover events always cause downtime, typically lasting 1-3 minutes. The writable primary Postgres database is a single-point-of-failure (SPOF). Patroni's automated failover mitigates this SPOF by quickly promoting a replica to become the new writable primary database, but during the failover event, there is always a period of time when dependent applications cannot write to the database.

Most of the Patroni failover events in the last 3 months appear to be caused by short-lived network connectivity disruption involving the current Patroni leader (i.e. the primary database). When that disruption lasts significantly less time than a failover would take, it would cause less impact to our services to avoid the failover. Addressing the network connectivity instability is a separate issue. This change only aims to improve the end-user experience by making Patroni wait a little longer before initiating failover.

Related reading:

Short on time? This note concisely and concretely summarizes the 3 known flavors of how Patroni failovers are being triggered in our gprd environment. It then outlines how each of the proposed remedies can improve the outcome.
This note describes in more detail the rationale behind this config change, as well as a few additional changes.
This issue corresponds directly to this configuration change.

Change request:

Change Objective	Describe the objective of the change
Change Type	C1
Services Impacted	Patroni
Change Team Members	@msmiley
Change Severity	~S3 Not applying the change makes it likely that we will continue to have unwanted Patroni failover events. Applying the change could potentially cause a Patroni failover event, which is why early steps in the sequence ask Patroni to forbid failovers during the maintenance window.
Buddy check	@ahmadsherif
Tested in staging	The change was tested on staging environment
Schedule of the change	2019-10-10 19:00 UTC (near the start of our production change window)
Duration of the change	30-60 minutes ( including a possible rollback )
Downtime Component	None expected, but a Patroni failover may occur if something unplanned occurs.

Detailed steps for the change:

1. Check initial state. Run this on any Patroni cluster member (e.g. `patroni-01`).

Command: gitlab-patronictl list
Command: gitlab-patronictl show-config
Verify all Patroni cluster members are healthy and running the Patroni agent.
Note which Patroni member is the current leader.
Note the initial config, for comparison later.

2. Disable chef-client on all Patroni cluster members, so it does not race with the dynamic config changes to be applied below.

Command (for production only): export TARGET_ENV="gprd"
Command (for staging only): export TARGET_ENV="gstg"
Command: knife ssh "roles:${TARGET_ENV}-base-db-patroni" 'sudo systemctl stop chef-client.service'
Verify there are no residual chef-client processes running:
- Command: knife ssh "roles:${TARGET_ENV}-base-db-patroni" 'pgrep -a -f "chef-client"'

3. Pause Patroni auto-failover. Run this once on any Patroni cluster member (e.g. `patroni-01`).

Purpose: This step forbids failover from occurring, instructing all cluster members to continue to treat the current Patroni leader as the leader until further notice. This protects against failover when the current Patroni leader's consul session is dropped and recreated to adjust its TTL.
Command: gitlab-patronictl pause --wait
Note: The pause and resume subcommands emit a clear error message if the cluster is already at the requested state.

4. Update only the intended DCS keys (`ttl` and `retry_timeout`), leaving the others unaltered (especially the `pause` state flag). Run this on any Patroni cluster member (e.g. `patroni-01`).

Command (interactive): gitlab-patronictl edit-config --set ttl=90 --set retry_timeout=30
The above command will interactively prompt for confirmation after showing the diffs it will apply to the current runtime config.
Verify the Patroni leader role does not change.
- Command: gitlab-patronictl list
- Command: sudo tail -f /var/log/gitlab/patroni/patroni.log
Verify the Patroni nodes acquire new Consul sessions with the new TTL. Note the TTL stored in Consul is always half the value specified in the Patroni config.
- Command: curl -s http://127.0.0.1:8500/v1/session/list | jq -c '.[] | { ID, Name, TTL, Checks }' | grep 'pg-ha-cluster' | sort

5. Merge the `chef-repo` merge request to update the patroni.yml file.

Note: Optionally do this sooner, any time after disabling chef-client on all Patroni nodes (step 2).
Staging: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1848
Production: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1958

6. Wait until the merge request's pipeline finishes, to ensure the role change is definitely available on Chef Server.

Note: If the Chef Server does have this merge request at the start of the next chef-client run, that Chef run will revert Patroni to the old settings, which may cause the Patroni leader to lose its cluster_lock and trigger a failover.

7. Run chef-client on any one Patroni node (e.g. `patroni-01`).

Note: This will implicitly unpause the Patroni cluster, so we want Patroni to find nothing has changed about the config set by Chef compared to its runtime state.
Command: sudo chef-client
Verify the Patroni leader role does not change.
- Command: gitlab-patronictl list
Verify the Patroni config is the new one, not the old one.
- Command: gitlab-patronictl show-config
Verify the Patroni cluster is no longer paused.
- Command: gitlab-patronictl resume
- Note: Expect the resume subcommand to fail with: Error: Cluster is not paused
Verify the systemd unit for Chef client is automatically restarted.
- Command: sudo systemctl is-active chef-client.service

8. Run chef-client on all remaining Patroni nodes.

Command: knife ssh "roles:${TARGET_ENV}-base-db-patroni" 'sudo chef-client' 2>&1 | tee /tmp/results.out
Repeat the step 7 verifications for all Patroni nodes once they complete their chef-client runs.

9. Verify health of Patroni cluster and overall GitLab.com.

Verify no new alerts have triggered in Slack channel #production and recall that many services rely indirectly on Postgres.
Review Grafana dashboards:
- GitLab Triage: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage
- PostgreSQL Overview: https://dashboards.gitlab.net/d/000000144/postgresql-overview

Rollback steps:

Optional: If chef-client is still disabled on all Patroni nodes, then to try to avoid a Patroni failover, you may start with these steps before running chef-client:
- Pause Patroni (if not still paused): gitlab-patronictl pause --wait
- On any Patroni node, use patronictl to revert the settings we changed to their original values: gitlab-patronictl edit-config --set ttl=30 --set retry_timeout=10
Revert the merge request.
Run chef-client on all Patroni nodes.

Edited Oct 10, 2019 by Matt Smiley