[GSTG] Remove race condition by archive WAL only on a single host per cluster
Production Change
Change Summary
Our goal is setup the main cluster with archive_mode=on
. The ci cluster with archive_mode=on
. And the replica that produces GCS disk snapshots maintain the archive_mode=always
.
Currently, all hosts in our PostgreSQL clusters are configured to archive WAL files. Per WAL file only one host can successfully upload it, all other hosts trying to upload this file also waste resource like bandwidth and CPU. Furthermore, we found a race condition leading to potential problems.
Further information
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15531
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15546
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15505#note_886316278
Change Details
- Services Impacted - ServicePostgres ServicePatroni Database
- Change Technician - @Finotto
- Change Reviewer - @alexander-sosna
- Time tracking - ~120 minutes
- Downtime Component - No downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5 minutes
-
Open relevant dashboards to monitor impact -
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60-90 minutes
-
Stop Chef client on all nodes in the CI cluster -
Merge Chef https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1596/diffs#108e238dcc70017e4dcd9d6dc630b9cbcff52495_0_17 -
Merge TF https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3617 -
Execute TF on [GSTG] module.patroni-ci
-
Do for each node in the CI cluster -
Take node out of the LB rotation by setting nofailover
andnoloadbalance
-
Verify that the node is not experiencing user / application load anymore -
Run Chef client locally -
Verify that the config was changed successfully -
Execute a checkpoint to optimize restart time -
Restart node -
Verify restart was successful and new settings are active -
Add node back to LB by removing nofailover
andnoloadbalance
, if the tags where not present before -
Verify that the node processes user / application load successfully, if it should
-
-
Verify that the whole cluster behaves normally -
Verify that the WAL archiving behaves normally and no pile-up happens -
Verify that the Chef client is enabled on all hosts -
Set label changecomplete on this issue
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 30 minutes
-
Verify that the whole cluster behaves normally for 30 minutes -
Verify that the WAL archiving behaves normally and no pile-up happens for 30 minutes
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1596/diffs#108e238dcc70017e4dcd9d6dc630b9cbcff52495_0_17 -
Revert https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3617 -
Execute TF on [GSTG] module.patroni-ci
-
Do for each standby node in the CI cluster -
Take node out of the LB rotation by setting nofailover
andnoloadbalance
-
Verify that the node is not experiencing user / application load anymore -
Run Chef client locally -
Verify that the config was changed successfully -
Execute a checkpoint to optimize restart time -
Restart node -
Verify restart was successful and new settings are active -
Add node back to LB by removing nofailover
andnoloadbalance
, if the tags where not present before -
Verify that the node processes user / application load successfully, if it should
-
-
Verify that the whole cluster behaves normally -
Verify that the WAL archiving behaves normally and no pile-up happens
Monitoring
Key metrics to observe
-
Metric: patroni-ci Service Error Ratio
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: Error rate going up
-
Metric: patroni-ci Service RPS / transactions_replica RPS - per fqdn
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: Request of the whole cluster or the finished hosts drop drastically
-
Metric: pg_archiver_pending_wal_count
- Location: pg_archiver_pending_wal_count{environment="gstg", type="patroni-ci"}
- What changes to this metric should prompt a rollback:
pg_archiver_pending_wal_count
should now grow monotonic.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). -
The change plan includes success measures for all steps/milestones during the execution. -
The change adequately minimizes risk within the environment/service. -
The performance implications of executing the change are well-understood and documented. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? -
The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.