Increase Patroni's patience when talking with Consul
C1
Production Change - Criticality 1Context and motivation for this change:
This change aims to reduce (but not eliminate) the frequency of unnecessary Patroni failover events, while still allowing Patroni to failover promptly when necessary.
Failover events always cause downtime, typically lasting 1-3 minutes. The writable primary Postgres database is a single-point-of-failure (SPOF). Patroni's automated failover mitigates this SPOF by quickly promoting a replica to become the new writable primary database, but during the failover event, there is always a period of time when dependent applications cannot write to the database.
Most of the Patroni failover events in the last 3 months appear to be caused by short-lived network connectivity disruption involving the current Patroni leader (i.e. the primary database). When that disruption lasts significantly less time than a failover would take, it would cause less impact to our services to avoid the failover. Addressing the network connectivity instability is a separate issue. This change only aims to improve the end-user experience by making Patroni wait a little longer before initiating failover.
Related reading:
- Short on time? This note concisely and concretely summarizes the 3 known flavors of how Patroni failovers are being triggered in our
gprd
environment. It then outlines how each of the proposed remedies can improve the outcome. - This note describes in more detail the rationale behind this config change, as well as a few additional changes.
- This issue corresponds directly to this configuration change.
Change request:
Change Objective | Describe the objective of the change |
---|---|
Change Type | C1 |
Services Impacted | Patroni |
Change Team Members | @msmiley |
Change Severity | ~S3 Not applying the change makes it likely that we will continue to have unwanted Patroni failover events. Applying the change could potentially cause a Patroni failover event, which is why early steps in the sequence ask Patroni to forbid failovers during the maintenance window. |
Buddy check | @ahmadsherif |
Tested in staging | The change was tested on staging environment |
Schedule of the change | 2019-10-10 19:00 UTC (near the start of our production change window) |
Duration of the change | 30-60 minutes ( including a possible rollback ) |
Downtime Component | None expected, but a Patroni failover may occur if something unplanned occurs. |
Detailed steps for the change:
patroni-01
).
1. Check initial state. Run this on any Patroni cluster member (e.g. - Command:
gitlab-patronictl list
- Command:
gitlab-patronictl show-config
- Verify all Patroni cluster members are healthy and running the Patroni agent.
- Note which Patroni member is the current leader.
- Note the initial config, for comparison later.
2. Disable chef-client on all Patroni cluster members, so it does not race with the dynamic config changes to be applied below.
- Command (for production only):
export TARGET_ENV="gprd"
- Command (for staging only):
export TARGET_ENV="gstg"
- Command:
knife ssh "roles:${TARGET_ENV}-base-db-patroni" 'sudo systemctl stop chef-client.service'
- Verify there are no residual
chef-client
processes running:- Command:
knife ssh "roles:${TARGET_ENV}-base-db-patroni" 'pgrep -a -f "chef-client"'
- Command:
patroni-01
).
3. Pause Patroni auto-failover. Run this once on any Patroni cluster member (e.g. - Purpose: This step forbids failover from occurring, instructing all cluster members to continue to treat the current Patroni leader as the leader until further notice. This protects against failover when the current Patroni leader's consul session is dropped and recreated to adjust its TTL.
- Command:
gitlab-patronictl pause --wait
- Note: The
pause
andresume
subcommands emit a clear error message if the cluster is already at the requested state.
ttl
and retry_timeout
), leaving the others unaltered (especially the pause
state flag). Run this on any Patroni cluster member (e.g. patroni-01
).
4. Update only the intended DCS keys (- Command (interactive):
gitlab-patronictl edit-config --set ttl=90 --set retry_timeout=30
- The above command will interactively prompt for confirmation after showing the diffs it will apply to the current runtime config.
- Verify the Patroni leader role does not change.
- Command:
gitlab-patronictl list
- Command:
sudo tail -f /var/log/gitlab/patroni/patroni.log
- Command:
- Verify the Patroni nodes acquire new Consul sessions with the new TTL. Note the TTL stored in Consul is always half the value specified in the Patroni config.
- Command:
curl -s http://127.0.0.1:8500/v1/session/list | jq -c '.[] | { ID, Name, TTL, Checks }' | grep 'pg-ha-cluster' | sort
- Command:
chef-repo
merge request to update the patroni.yml file.
5. Merge the - Note: Optionally do this sooner, any time after disabling chef-client on all Patroni nodes (step 2).
- Staging: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1848
- Production: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1958
6. Wait until the merge request's pipeline finishes, to ensure the role change is definitely available on Chef Server.
- Note: If the Chef Server does have this merge request at the start of the next
chef-client
run, that Chef run will revert Patroni to the old settings, which may cause the Patroni leader to lose its cluster_lock and trigger a failover.
patroni-01
).
7. Run chef-client on any one Patroni node (e.g. - Note: This will implicitly unpause the Patroni cluster, so we want Patroni to find nothing has changed about the config set by Chef compared to its runtime state.
- Command:
sudo chef-client
- Verify the Patroni leader role does not change.
- Command:
gitlab-patronictl list
- Command:
- Verify the Patroni config is the new one, not the old one.
- Command:
gitlab-patronictl show-config
- Command:
- Verify the Patroni cluster is no longer paused.
- Command:
gitlab-patronictl resume
- Note: Expect the
resume
subcommand to fail with:Error: Cluster is not paused
- Command:
- Verify the systemd unit for Chef client is automatically restarted.
- Command:
sudo systemctl is-active chef-client.service
- Command:
8. Run chef-client on all remaining Patroni nodes.
- Command:
knife ssh "roles:${TARGET_ENV}-base-db-patroni" 'sudo chef-client' 2>&1 | tee /tmp/results.out
- Repeat the step 7 verifications for all Patroni nodes once they complete their chef-client runs.
9. Verify health of Patroni cluster and overall GitLab.com.
- Verify no new alerts have triggered in Slack channel #production and recall that many services rely indirectly on Postgres.
- Review Grafana dashboards:
- GitLab Triage: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage
- PostgreSQL Overview: https://dashboards.gitlab.net/d/000000144/postgresql-overview
Rollback steps:
- Optional: If chef-client is still disabled on all Patroni nodes, then to try to avoid a Patroni failover, you may start with these steps before running chef-client:
- Pause Patroni (if not still paused):
gitlab-patronictl pause --wait
- On any Patroni node, use
patronictl
to revert the settings we changed to their original values:gitlab-patronictl edit-config --set ttl=30 --set retry_timeout=10
- Pause Patroni (if not still paused):
- Revert the merge request.
- Run chef-client on all Patroni nodes.