GPRD: wal-g update to v1.1

Production Change

Change Summary

Update wal-g in production instances to v1.1.

Reference: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13945

Change Details

Services Impacted - ServicePostgres ServicePatroni
Change Technician - @rehab
Change Reviewer - @Finotto
Time tracking - 120 minutes
Downtime Component - none

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 5 minutes

Set label changein-progress on this issue
Get necessary approval for https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/603
Run knife ssh -C1 "roles:gprd-base-db-patroni OR roles:gprd-walg" "chef-client-is-enabled" and state output in the comments section.

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 20 minutes

Disable chef-client: knife ssh -C1 "roles:gprd-base-db-patroni OR roles:gprd-walg" "chef-client-disable".
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/603
Run chef-client on 1 replica patroni instance and ensure it's successful.
Ensure /opt/wal-g/bin/wal-g --version binary is showing version as v1.1.
Follow all three verification sections under postgresql-backups-wale-walg.md.

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - indeterminate (as much as the backup process takes).

Check that WAL-G binary is working and shows the expected version.

cd /tmp
sudo -u gitlab-psql /usr/bin/envdir /etc/wal-g.d/env /opt/wal-g/bin/wal-g --version

Check the logs

sudo tail /var/log/wal-g/wal-g.log
# Check all or (the one) replica where this is happening
sudo tail /var/log/wal-g/wal-g_backup_push.log.1

Once a new full backup is created, check the list of full backups available:

sudo -u gitlab-psql /usr/bin/envdir /etc/wal-g.d/env /opt/wal-g/bin/wal-g backup-list

Enable chef-client: knife ssh -C1 "roles:gprd-base-db-patroni OR roles:gprd-walg" "chef-client-enable".
After 1-2 days, check that verification jobs ("gitlab-restore" project) are not failing: Runbook.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 30 minutes

Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/603
Ensure chef-client successfully ran on all affected instances.

Monitoring

Key metrics to observe

Metric: Chef Clients
- Location: https://dashboards.gitlab.net/d/000000231/chef-client?orgId=1&refresh=1m
- What changes to this metric should prompt a rollback: chef-clients are failing
Metric: PostgreSQL backup
- Location: https://dashboards.gitlab.net/d/000000172/postgresql-backups?orgId=1&refresh=5m&from=now-5m&to=now
- What changes to this metric should prompt a rollback: if wal-g metrics drop to zero.

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

Edited Sep 25, 2021 by Rehab