WIP: Improve safety and efficiency of Patroni post-failover maintenance task
Production Change
Change Summary
As a follow-up to #3021 (closed), this change request switches our custom Patroni post-failover hook to run VACUUM
instead of ANALYZE
. This is safer and cheaper.
Both ANALYZE
and VACUUM
will estimate a table's live/dead tuple counts and notify the "stats collector" process. They use different code paths but comparable logic for computing those count estimates. In the context of a Patroni failover, the new primary db already has valid optimizer statistics (i.e. column-level distributions and table-level live row counts), so it's not necessary to run ANALYZE
(and doing so with non-standard sampling sizes can be hazardous, as explained in #2937 (comment 445229140)). The only statistics we want to update after a failover are the live/dead tuple count estimates presented by "stats collector" via the pg_stat_user_tables
view. VACUUM
does that without touching the column-level statistics, so it is much less likely to disrupt the query performance by altering the optimizer's decisions about execution plans.
Change Details
- Services Impacted - Patroni, Postgres
- Change Technician - @msmiley
- Change Criticality - C2
- Change Type - changeunscheduled
- Change Reviewer - DRI for the review of this change
- Due Date - Date and time (in UTC) for the execution of the change
- Time tracking - 60 minutes
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 2 minutes
-
Verify the Chef role being updated ( gprd-base-db-patroni
) is assigned to all Patroni cluster members
# On any cluster member, list all cluster members.
$ sudo gitlab-patronictl list
# On chef workstation, list all nodes to which the role is assigned.
$ knife search -i 'roles:gprd-base-db-patroni' | sort -V
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 50 minutes
-
Verify the cookbook MR has already been merged and applied to the staging environment. -
Merge and apply the 1st chef-repo MR to bump chef's pinned cookbook version for the production environment. TODO: Add link to the version bump MR. -
Merge and apply the 2nd chef-repo MR to update Patroni to run the new helper script after failover events. (The old script is still present, so rollback only would only require reverting this MR, not the other one.) -
Serially run chef-client on the 8 Patroni hosts. - Since this change updates the Patroni configuration (
/var/opt/gitlab/patroni/patroni.yml
), Chef will notify systemd to reload (but not restart) the Patroni service. - This should be non-disruptive, because the setting we are changing does not require Patroni to release its leader lock or restart Postgres.
- Out of caution, we may opt to temporarily disable chef-client on the 8 Patroni hosts and serially apply this change to all non-primary hosts before finally applying it to the primary host.
- Since this change updates the Patroni configuration (
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 5 minutes
-
Verify new script exists on all Patroni hosts.
$ knife ssh 'roles:gprd-base-db-patroni' 'ls -l /var/opt/gitlab/patroni/scripts/post-failover-maintenance.sh'
-
Verify the patroni.yml config file defines the new script as its on_role_change
callback action.
$ knife ssh 'roles:gprd-base-db-patroni' 'sudo grep "on_role_change" /var/opt/gitlab/patroni/patroni.yml'
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 60 minutes (less if we run chef-client concurrently)
-
Revert and apply just the 2nd chef-repo MR, which updated Patroni to run the new helper script after failover events. (The old script is still present, so rollback only would only require reverting this MR, not the other one.) -
Run chef-client on each Patroni node, to revert patroni.yml
to refer to the old failover script.
Monitoring
Key metrics to observe
Expect no change in the operational characteristics of Patroni or Postgres. Throughput, response time, availability, etc. should not be affected, as this change is only altering how Patroni handles a failure event, not its normal healthy operating behavior.
- Metric: All SLI metrics on the Patroni service dashboard should remain stable throughout this change window.
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1
- If any SLI metric (latency, error rate, throughput) significantly changes, stop applying the change and investigate. This config change is very unlikely to cause a service disruption, so if suspicious symptoms present, stop applying the change and take a moment to evaluate the health of an affected Patroni node before deciding whether or not to rollback.
Summary of infrastructure changes
-
Does this change introduce new compute instances? No. -
Does this change re-size any existing compute instances? No. -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc? No.
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and resultes noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue.) -
There are currently no active incidents.