Increase the disks allocated to the Patroni Cluster

Change Summary

At the time of this writing, the disks from the Patroni cluster are over 88.5% full.

We've had to execute a pg_repack several times, to reduce the bloat and release free space.

This issue covers the production DB disk size increase from 10 TB to 16 TB.

This maintenance has to be done in 2 steps:

Increase the disk size in Terraform. Be careful to execute this step in LOW PEAK TIME. Since will impact the performance of the hosts. I recommend execute in the GCP node by node, and them update TF to be consistent with the values.
- Staging MR: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2273
- Production MR: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2277
After we need to execute a resize2fs in each host that we increased the disk. It is recommended to execute in off-peak time too.

The command is : resize2fs /dev/sdb

Change Details

Services Impacted - Database
Change Technician - Craig Barrett, as EOC
Change Criticality - C1,
Change Type - changescheduled
Change Reviewer - Henri Philips
Due Date - Sunday 17th January 2021, 00:00 UTC
Time tracking - 1 hour
Downtime Component - this change will not require downtime

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Collect a list of affected volumes in this issue for subsequent command/validation output
Record disk device name, size, and filesystem utilization on all hosts for post-implementation comparison/validation (ref: #2648 (comment 485754664))
Identify dashboards/metrics required to monitor performance and functionality during the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes.

Participants join the persistent Incident Zoom call ahead of Sunday 17th January 2021, 00:00 UTC

Create snapshots for all data volumes before merging/applying the terraform MR. For optimal time, create snapshots in parallel for each availability zone.

# Parallel by zone
snaps=$(echo {patroni-0{1,4,7},postgres-dr-{archive,delayed}-01}-db-gprd-data-snap-2648|tr ' ' ',')
gcloud compute disks snapshot {patroni-0{1,4,7},postgres-dr-{archive,delayed}-01}-db-gprd-data --description "production#2648 patroni data disk resize $(date -u +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${snaps} --zone=us-east1-c

snaps=$(echo patroni-0{2,5,8}-db-gprd-data-snap-2648|tr ' ' ',')
gcloud compute disks snapshot patroni-0{2,5,8}-db-gprd-data --description "production#2648 patroni data disk resize $(date -u +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${snaps} --zone=us-east1-d

snaps=$(echo patroni-0{3,6}-db-gprd-data-snap-2648|tr ' ' ',')
gcloud compute disks snapshot patroni-0{3,6}-db-gprd-data --description "production#2648 patroni data disk resize $(date -u +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${snaps} --zone=us-east1-b

# Individually
disk=patroni-01-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
disk=patroni-02-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-d
disk=patroni-03-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-b
disk=patroni-04-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
disk=patroni-05-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-d
disk=patroni-06-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-b
disk=patroni-07-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
disk=patroni-08-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-d
disk=postgres-dr-archive-01-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
disk=postgres-dr-delayed-01-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c

Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2277
Log in to any patroni node and run sudo gitlab-patronictl list to determine which node is primary; adjust the steps as described if the primary has changed since this issue was last updated
1. Adjustment is not necessary as patroni-06 is still primary
2. Adjustment is necessary as patroni-06 is no longer the primary

Ensure local terraform repository is up to date

cd LOCAL_GIT_REPO_DIR/gitlab-com-infrastructure
git checkout master
git fetch origin
git reset --hard origin/master
cd environments/gprd

Replicas

patroni-01

Run tf apply -target='module.patroni.google_compute_disk.data_disk[0]' to resize the disk
Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size
Validate database health
1. Check monitoring for anomalies
2. Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo tail -n 50f /var/log/postgresql/postgresql-11-main.log to check the database log for errors
3. Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'for ((i=0; i<6; i++)); do echo "$(date +%H:%M:%S): $(netstat -an|grep :5432|wc -l)"; sleep 10; done'

patroni-02

Run tf apply -target='module.patroni.google_compute_disk.data_disk[1]' to resize the disk
Run host=patroni-02-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
Run host=patroni-02-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
Run host=patroni-02-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

patroni-04

Run tf apply -target='module.patroni.google_compute_disk.data_disk[3]' to resize the disk
Run host=patroni-04-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
Run host=patroni-04-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
Run host=patroni-04-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

patroni-05

Run tf apply -target='module.patroni.google_compute_disk.data_disk[4]' to resize the disk
Run host=patroni-05-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
Run host=patroni-05-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
Run host=patroni-05-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

patroni-06

Run tf apply -target='module.patroni.google_compute_disk.data_disk[5]' to resize the disk
Run host=patroni-06-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
Run host=patroni-06-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
Run host=patroni-06-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

patroni-07

Run tf apply -target='module.patroni.google_compute_disk.data_disk[6]' to resize the disk
Run host=patroni-07-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
Run host=patroni-07-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
Run host=patroni-07-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

patroni-08

Run tf apply -target='module.patroni.google_compute_disk.data_disk[7]' to resize the disk
Run host=patroni-08-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
Run host=patroni-08-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
Run host=patroni-08-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

Primary

patroni-03

Run tf apply -target='module.patroni.google_compute_disk.data_disk[2]' to resize the disk
Run host=patroni-03-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
Run host=patroni-03-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
Run host=patroni-03-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

Archive and Delayed Replica

postgres-dr-archive-01

Run tf apply -target='module.postgres-dr-archive.google_compute_disk.data_disk[0]' to resize the disk
Run host=postgres-dr-archive-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdc' to validate the new volume size.
Run host=postgres-dr-archive-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdc'
Run host=postgres-dr-archive-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdc' to verify new filesystem size

postgres-dr-delayed-01

Run tf apply -target='module.postgres-dr-delayed.google_compute_disk.data_disk[0]' to resize the disk
Run host=postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
Run host=postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
Run host=postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

Post-Change Steps

Verification

N/A - validation steps included as each device is resized above

Clean up

Remove snapshots after final disk is resized. We do not need to retain the snapshots for any length of time, as the data will age out too quickly for them to be effective long-term.

gcloud compute snapshots delete {patroni-0{1..8},postgres-dr-{archive,delayed}-01}-db-gprd-snap-2648

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete per node (mins) - 15 min

Revert the production terraform MR to use old data volume size -- DO NOT APPLY!!!

For each node

Open two terminals, one remote, one local
If node is primary, initiate a switchover process

Remote commands

Drain connections to the node
1. sudo chef-client-disable "Rollback for production#2648" - disable chef-client
2. Add a tags section to /var/opt/gitlab/patroni/patroni.yml on the node:
```
tags:
  nofailover: true
  noloadbalance: true
```
3. sudo systemctl reload patroni
4. Test the efficacy of that reload by checking for the node name in the list of replicas:
```
dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV
```
  If the name is absent, then the reload worked.
5. Wait until all client connections are drained from the replica (it depends on the interval value set for the clients), use this command to track number of client connections:
```
while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;';  done  | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
```
sudo systemctl stop patroni - Stop postgresql/patroni
sudo systemctl stop pgbouncer && sudo systemctl stop pgbouncer-1 && sudo systemctl stop pgbouncer-2 - Stop pgbouncer
ls -l /dev/disk/by-id|grep DEVICE where DEVICE==sdc on postgres-dr-archive-01-db-gprd and sdb on all other nodes; note the /dev/disk/by-id/google-DEVICE-NAME
sudo lsof +f -- /var/opt/gitlab - Verify there are no open file handles for the volume
sudo umount /var/opt/gitlab

Local commands

gcloud compute instances detach-disk INSTANCE_NAME --disk=DISK --zone=ZONE
gcloud compute disks delete DISK --zone ZONE
gcloud compute disks create DISK --source-snapshot SNAPSHOT --size 16TB --type pd_ssd --labels do_snapshots='true',environment=gprd,pet_name=patroni --zone ZONE
gcloud compute instances attach-disk INSTANCE_NAME --disk DISK --device-name DEVICE-NAME --zone ZONE
tf plan - Perform a terraform plan to validate there are no outstanding changes after the revert and disk resize

Remote commands

sudo lsblk /dev/DEVICE to verify new size, where DEVICE==sdc on postgres-dr-archive-01-db-gprd and sdb on all other nodes
sudo resize2fs /dev/sdb - Expand the filesystem back to fill the now reduced volume
sudo mount /var/opt/gitlab
df -h /var/opt/gitlab - Verify filesystem size
Remove tags from /var/opt/gitlab/patroni/patroni.yml tags: {}
sudo systemctl start patroni - Restart patroni
sudo systemctl start pgbouncer && sudo systemctl start pgbouncer-1 && sudo systemctl start pgbouncer-2 - Restart pgbouncer
sudo gitlab-patronictl list to verify node re-joined the cluster and check replication lag

Verify client connections ramping up

while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;';  done  | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done

sudo chef-client-enable - Re-enable chef-client

Fallback

If the above process to gracefully resize the volume back down fails or is otherwise not possible, utilize the snapshots created before starting maintenance to restore the device to it's pre-maintenance state. This will require draining/stopping the instance, unmounting the filesystem, removing the current disk, and deleting it via GCP so that we can recreate a new disk of the same name from the snapshot. If the names change, this will have unintended effects with our current terraform code.

Reference #2115 (closed) for a similar exercise conducted in the past

Monitoring

We should monitor the following dashboards:

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
This issue has the change technician as the assignee.
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
Necessary approvals have been completed based on the Change Management Workflow.
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
There are currently no active incidents.

Edited Jan 17, 2021 by Craig Barrett