[GPRD] Remove race condition by archive WAL only on a single host per cluster
Production Change
Change Summary
Our goal is setup all nodes on the main cluster with archive_mode=on
, so only the master node will archive WAL.
The ci cluster has no master, so one node needs to be configured with archive_mode=always
, we choose the node that is used to GCS disk snapshots.
Currently, all hosts in our PostgreSQL clusters are configured to archive WAL files. Per WAL file only one host can successfully upload it, all other hosts trying to upload this file also waste resource like bandwidth and CPU. Furthermore, we found a race condition leading to potential problems.
We are not restarting the primary node.
Further information
- #6742 (closed)
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15531
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15546
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15505#note_886316278
Change Details
- Services Impacted - ServicePostgres ServicePatroni Database
- Change Technician - @Finotto
- Change Reviewer - @alexander-sosna
- Time tracking - ~120 minutes
- Downtime Component - No downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5 minutes
-
Open relevant dashboards to monitor impact -
Set label changein-progress on this issue
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - ~90-120 minutes
-
Merge Chef https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1618 -
Merge TF https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3626 -
Execute a git pull
of the config-mgmt repository to update the local copy -
Choose TF environment cd environments/gprd/
-
Initialize TF ../../bin/tf init --upgrade
CI cluster - gprd-patroni-ci
-
Get cluster state, gitlab-patronictl topology
-
Execute TF on [GPRD] module.patroni-ci
with../../bin/tf apply -target module.patroni-ci
-
Do for each node in the cluster -
Verify that the config was changed successfully, grep 'archive_mode' /var/opt/gitlab/postgresql/data12/postgresql.conf
-
Disable Chef client with, sudo chef-client-disable "Database maintenance issue 6754, https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6754"
and verify it is disabled -
Take node out of the LB rotation by setting nofailover
andnoloadbalance
-
Verify that the node is not experiencing user / application load anymore, you can use SELECT COUNT(*) FROM pg_stat_activity WHERE usename = 'gitlab' AND state != 'idle';
,SELECT * FROM pg_stat_activity;
-
Execute a checkpoint to optimize restart time postgres=# CHECKPOINT;
orgitlab-psql -c 'CHECKPOINT;'
-
Restart node to load changes requiring restart, gitlb-patronitl restart gprd-patroni-ci <host-name>
-
Verify restart was successful and new settings are active -
Enable Chef client on current, chef-client-enable
-
Add node back to LB by removing nofailover
andnoloadbalance
, if the tags where not present before -
Verify that the node processes user / application load successfully, if it should, you can use SELECT COUNT(*) FROM pg_stat_activity WHERE usename = 'gitlab' AND state != 'idle';
,SELECT * FROM pg_stat_activity;
-
Enable Chef client with, sudo chef-client-enable
-
-
Verify that the whole cluster behaves normally -
Verify that the WAL archiving behaves normally and no pile-up happens -
Verify that the Chef client is enabled on all hosts
Main cluster - pg12-patroni-cluster
-
Get cluster state, gitlab-patronictl topology
-
Do for each node in the cluster, except the master! -
Disable Chef client with, sudo chef-client-disable "Database maintenance issue 6754, https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6754"
and verify it is disabled -
Verify that the config was changed successfully -
Take node out of the LB rotation by setting nofailover
andnoloadbalance
-
Verify that the node is not experiencing user / application load anymore, you can use SELECT COUNT(*) FROM pg_stat_activity WHERE usename = 'gitlab';
,SELECT * FROM pg_stat_activity;
-
Execute a checkpoint to optimize restart time postgres=# CHECKPOINT;
orgitlab-psql -c 'CHECKPOINT;'
-
Restart node to load changes requiring restart, gitlb-patronitl restart gprd-patroni-cluster <host-name>
-
Verify restart was successful and new settings are active -
Enable Chef client on current, chef-client-enable
-
Add node back to LB by removing nofailover
andnoloadbalance
, if the tags where not present before -
Verify that the node processes user / application load successfully, if it should, you can use SELECT COUNT(*) FROM pg_stat_activity WHERE usename = 'gitlab';
,SELECT * FROM pg_stat_activity;
-
Enable Chef client with, sudo chef-client-enable
-
-
Verify that the whole cluster behaves normally -
Verify that the WAL archiving behaves normally and no pile-up happens -
Verify that the Chef client is enabled on all hosts
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 30 minutes
-
Verify that the both clusters behaves normally for 30 minutes -
Verify that the WAL archiving behaves normally and no pile-up happens for 30 minutes -
Set label changecomplete on this issue
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes
-
Revert Chef https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1618 -
Revert TF https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3626 -
Execute a git pull
of the config-mgmt repository to update the local copy -
Choose TF environment cd environments/gprd/
-
Initialize TF ../../bin/tf init --upgrade
-
Execute TF on [GSTG] module.patroni-ci
-
Do for each node in the cluster -
Disable Chef client with, sudo chef-client-disable "Database maintenance issue 6754, https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6754"
and verify it is disabled -
Verify that the config was changed successfully -
Take node out of the LB rotation by setting nofailover
andnoloadbalance
-
Verify that the node is not experiencing user / application load anymore, you can use SELECT COUNT(*) FROM pg_stat_activity WHERE usename = 'gitlab' AND state != 'idle';
-
Execute a checkpoint to optimize restart time postgres=# CHECKPOINT;
orgitlab-psql -c 'CHECKPOINT;'
-
Restart node to load changes requiring restart, gitlb-patronitl restart gprd-patroni-ci <host-name>
-
Verify restart was successful and new settings are active -
Enable Chef client on current, chef-client-enable
-
Add node back to LB by removing nofailover
andnoloadbalance
, if the tags where not present before -
Verify that the node processes user / application load successfully, if it should, you can use SELECT COUNT(*) FROM pg_stat_activity WHERE usename = 'gitlab' AND state != 'idle';
,SELECT COUNT * FROM pg_stat_activity;
-
Enable Chef client with, sudo chef-client-enable
-
-
Verify that the whole cluster behaves normally -
Verify that the WAL archiving behaves normally and no pile-up happens -
Verify that the Chef client is enabled on all hosts
Monitoring
Key metrics to observe
CI cluser - gprd-patroni-ci
-
Metric: patroni-ci Service Error Ratio
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: Error rate going up
-
Metric: patroni-ci Service RPS / transactions_replica RPS - per fqdn
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: Request of the whole cluster or the finished hosts drop drastically
-
Metric: pg_archiver_pending_wal_count
- Location: pg_archiver_pending_wal_count{environment="gprd", type="patroni-ci"}}
- What changes to this metric should prompt a rollback:
pg_archiver_pending_wal_count
should now grow monotonic.
Main cluster - pg12-patroni-cluster
-
Metric: patroni-ci Service Error Ratio
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: Error rate going up
-
Metric: patroni-ci Service RPS / transactions_replica RPS - per fqdn
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: Request of the whole cluster or the finished hosts drop drastically
-
Metric: pg_archiver_pending_wal_count
- Location: pg_archiver_pending_wal_count{environment="gprd", type="patroni-ci"}}
- What changes to this metric should prompt a rollback:
pg_archiver_pending_wal_count
should now grow monotonic.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). -
The change plan includes success measures for all steps/milestones during the execution. -
The change adequately minimizes risk within the environment/service. -
The performance implications of executing the change are well-understood and documented. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? -
The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.