GPRD: wal-g update to v1.1
Production Change
Change Summary
Update wal-g in production instances to v1.1.
Reference: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13945
Change Details
- Services Impacted - ServicePostgres ServicePatroni
-
Change Technician -
@rehab - Change Reviewer - @Finotto
- Time tracking - 120 minutes
- Downtime Component - none
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 5 minutes
-
Set label changein-progress on this issue -
Get necessary approval for https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/603 -
Run knife ssh -C1 "roles:gprd-base-db-patroni OR roles:gprd-walg" "chef-client-is-enabled"and state output in the comments section.
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 20 minutes
-
Disable chef-client: knife ssh -C1 "roles:gprd-base-db-patroni OR roles:gprd-walg" "chef-client-disable". -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/603 -
Run chef-clienton 1 replica patroni instance and ensure it's successful. -
Ensure /opt/wal-g/bin/wal-g --versionbinary is showing version asv1.1. -
Follow all three verification sections under postgresql-backups-wale-walg.md.
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - indeterminate (as much as the backup process takes).
-
Check that WAL-G binary is working and shows the expected version. cd /tmp sudo -u gitlab-psql /usr/bin/envdir /etc/wal-g.d/env /opt/wal-g/bin/wal-g --version -
Check the logs sudo tail /var/log/wal-g/wal-g.log # Check all or (the one) replica where this is happening sudo tail /var/log/wal-g/wal-g_backup_push.log.1 -
Once a new full backup is created, check the list of full backups available: sudo -u gitlab-psql /usr/bin/envdir /etc/wal-g.d/env /opt/wal-g/bin/wal-g backup-list -
Enable chef-client: knife ssh -C1 "roles:gprd-base-db-patroni OR roles:gprd-walg" "chef-client-enable". -
After 1-2 days, check that verification jobs ("gitlab-restore" project) are not failing: Runbook.
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 30 minutes
-
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/603 -
Ensure chef-clientsuccessfully ran on all affected instances.
Monitoring
Key metrics to observe
-
Metric: Chef Clients
- Location: https://dashboards.gitlab.net/d/000000231/chef-client?orgId=1&refresh=1m
- What changes to this metric should prompt a rollback: chef-clients are failing
-
Metric: PostgreSQL backup
- Location: https://dashboards.gitlab.net/d/000000172/postgresql-backups?orgId=1&refresh=5m&from=now-5m&to=now
- What changes to this metric should prompt a rollback: if wal-g metrics drop to zero.
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncalland this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managersand this issue and await their acknowledgment.) -
There are currently no active incidents.