[GPRD] Remove race condition by archive WAL only on a single host per cluster

Production Change

Change Summary

Our goal is setup all nodes on the main cluster with archive_mode=on, so only the master node will archive WAL. The ci cluster has no master, so one node needs to be configured with archive_mode=always, we choose the node that is used to GCS disk snapshots.

Currently, all hosts in our PostgreSQL clusters are configured to archive WAL files. Per WAL file only one host can successfully upload it, all other hosts trying to upload this file also waste resource like bandwidth and CPU. Furthermore, we found a race condition leading to potential problems.

We are not restarting the primary node.

Further information

Change Details

Services Impacted - ServicePostgres ServicePatroni Database
Change Technician - @Finotto
Change Reviewer - @alexander-sosna
Time tracking - ~120 minutes
Downtime Component - No downtime

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5 minutes

Open relevant dashboards to monitor impact
Set label changein-progress on this issue

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - ~90-120 minutes

Merge Chef https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1618
Merge TF https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3626
Execute a git pull of the config-mgmt repository to update the local copy
Choose TF environment cd environments/gprd/
Initialize TF ../../bin/tf init --upgrade

CI cluster - gprd-patroni-ci

Main cluster - pg12-patroni-cluster

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 30 minutes

Verify that the both clusters behaves normally for 30 minutes
Verify that the WAL archiving behaves normally and no pile-up happens for 30 minutes
Set label changecomplete on this issue

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Monitoring

Key metrics to observe

CI cluser - gprd-patroni-ci

Metric: patroni-ci Service Error Ratio
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: Error rate going up
Metric: patroni-ci Service RPS / transactions_replica RPS - per fqdn
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: Request of the whole cluster or the finished hosts drop drastically
Metric: pg_archiver_pending_wal_count
- Location: pg_archiver_pending_wal_count{environment="gprd", type="patroni-ci"}}
- What changes to this metric should prompt a rollback: pg_archiver_pending_wal_count should now grow monotonic.

Main cluster - pg12-patroni-cluster

Metric: patroni-ci Service Error Ratio
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: Error rate going up
Metric: patroni-ci Service RPS / transactions_replica RPS - per fqdn
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: Request of the whole cluster or the finished hosts drop drastically
Metric: pg_archiver_pending_wal_count
- Location: pg_archiver_pending_wal_count{environment="gprd", type="patroni-ci"}}
- What changes to this metric should prompt a rollback: pg_archiver_pending_wal_count should now grow monotonic.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Apr 06, 2022 by Alexander Sosna