WIP: Improve safety and efficiency of Patroni post-failover maintenance task

Production Change

Change Summary

As a follow-up to #3021 (closed), this change request switches our custom Patroni post-failover hook to run VACUUM instead of ANALYZE. This is safer and cheaper.

Both ANALYZE and VACUUM will estimate a table's live/dead tuple counts and notify the "stats collector" process. They use different code paths but comparable logic for computing those count estimates. In the context of a Patroni failover, the new primary db already has valid optimizer statistics (i.e. column-level distributions and table-level live row counts), so it's not necessary to run ANALYZE (and doing so with non-standard sampling sizes can be hazardous, as explained in #2937 (comment 445229140)). The only statistics we want to update after a failover are the live/dead tuple count estimates presented by "stats collector" via the pg_stat_user_tables view. VACUUM does that without touching the column-level statistics, so it is much less likely to disrupt the query performance by altering the optimizer's decisions about execution plans.

Change Details

Services Impacted - Patroni, Postgres
Change Technician - @msmiley
Change Criticality - C2
Change Type - changeunscheduled
Change Reviewer - DRI for the review of this change
Due Date - Date and time (in UTC) for the execution of the change
Time tracking - 60 minutes
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 2 minutes

Verify the Chef role being updated (gprd-base-db-patroni) is assigned to all Patroni cluster members

# On any cluster member, list all cluster members.
$ sudo gitlab-patronictl list

# On chef workstation, list all nodes to which the role is assigned.
$ knife search -i 'roles:gprd-base-db-patroni' | sort -V

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 50 minutes

Verify the cookbook MR has already been merged and applied to the staging environment.
Merge and apply the 1st chef-repo MR to bump chef's pinned cookbook version for the production environment. TODO: Add link to the version bump MR.
Merge and apply the 2nd chef-repo MR to update Patroni to run the new helper script after failover events. (The old script is still present, so rollback only would only require reverting this MR, not the other one.)
Serially run chef-client on the 8 Patroni hosts.
- Since this change updates the Patroni configuration (/var/opt/gitlab/patroni/patroni.yml), Chef will notify systemd to reload (but not restart) the Patroni service.
- This should be non-disruptive, because the setting we are changing does not require Patroni to release its leader lock or restart Postgres.
- Out of caution, we may opt to temporarily disable chef-client on the 8 Patroni hosts and serially apply this change to all non-primary hosts before finally applying it to the primary host.

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 5 minutes

Verify new script exists on all Patroni hosts.

$ knife ssh 'roles:gprd-base-db-patroni' 'ls -l /var/opt/gitlab/patroni/scripts/post-failover-maintenance.sh'

Verify the patroni.yml config file defines the new script as its on_role_change callback action.

$ knife ssh 'roles:gprd-base-db-patroni' 'sudo grep "on_role_change" /var/opt/gitlab/patroni/patroni.yml'

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 60 minutes (less if we run chef-client concurrently)

Revert and apply just the 2nd chef-repo MR, which updated Patroni to run the new helper script after failover events. (The old script is still present, so rollback only would only require reverting this MR, not the other one.)
Run chef-client on each Patroni node, to revert patroni.yml to refer to the old failover script.

Monitoring

Key metrics to observe

Expect no change in the operational characteristics of Patroni or Postgres. Throughput, response time, availability, etc. should not be affected, as this change is only altering how Patroni handles a failure event, not its normal healthy operating behavior.

Metric: All SLI metrics on the Patroni service dashboard should remain stable throughout this change window.
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1
- If any SLI metric (latency, error rate, throughput) significantly changes, stop applying the change and investigate. This config change is very unlikely to cause a service disruption, so if suspicious symptoms present, stop applying the change and take a moment to evaluate the health of an affected Patroni node before deciding whether or not to rollback.

Summary of infrastructure changes

Does this change introduce new compute instances? No.
Does this change re-size any existing compute instances? No.
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No.

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and resultes noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue.)
There are currently no active incidents.

Edited Nov 17, 2020 by Matt Smiley