Increase the disks allocated to the Patroni Cluster
Change Summary
At the time of this writing, the disks from the Patroni cluster are over 88.5% full.
We've had to execute a pg_repack several times, to reduce the bloat and release free space.
This issue covers the production DB disk size increase from 10 TB to 16 TB.
This maintenance has to be done in 2 steps:
- Increase the disk size in Terraform. Be careful to execute this step in LOW PEAK TIME. Since will impact the performance of the hosts. I recommend execute in the GCP node by node, and them update TF to be consistent with the values.
- After we need to execute a resize2fs in each host that we increased the disk. It is recommended to execute in off-peak time too.
The command is : resize2fs /dev/sdb
Change Details
- Services Impacted - Database
- Change Technician - Craig Barrett, as EOC
- Change Criticality - C1,
- Change Type - changescheduled
- Change Reviewer - Henri Philips
- Due Date - Sunday 17th January 2021, 00:00 UTC
- Time tracking - 1 hour
- Downtime Component - this change will not require downtime
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
-
Collect a list of affected volumes in this issue for subsequent command/validation output -
Record disk device name, size, and filesystem utilization on all hosts for post-implementation comparison/validation (ref: #2648 (comment 485754664)) -
Identify dashboards/metrics required to monitor performance and functionality during the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60 minutes.
-
Participants join the persistent Incident Zoom call ahead of Sunday 17th January 2021, 00:00 UTC -
Create snapshots for all data volumes before merging/applying the terraform MR. For optimal time, create snapshots in parallel for each availability zone. # Parallel by zone snaps=$(echo {patroni-0{1,4,7},postgres-dr-{archive,delayed}-01}-db-gprd-data-snap-2648|tr ' ' ',') gcloud compute disks snapshot {patroni-0{1,4,7},postgres-dr-{archive,delayed}-01}-db-gprd-data --description "production#2648 patroni data disk resize $(date -u +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${snaps} --zone=us-east1-c snaps=$(echo patroni-0{2,5,8}-db-gprd-data-snap-2648|tr ' ' ',') gcloud compute disks snapshot patroni-0{2,5,8}-db-gprd-data --description "production#2648 patroni data disk resize $(date -u +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${snaps} --zone=us-east1-d snaps=$(echo patroni-0{3,6}-db-gprd-data-snap-2648|tr ' ' ',') gcloud compute disks snapshot patroni-0{3,6}-db-gprd-data --description "production#2648 patroni data disk resize $(date -u +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${snaps} --zone=us-east1-b # Individually disk=patroni-01-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c disk=patroni-02-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-d disk=patroni-03-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-b disk=patroni-04-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c disk=patroni-05-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-d disk=patroni-06-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-b disk=patroni-07-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c disk=patroni-08-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-d disk=postgres-dr-archive-01-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c disk=postgres-dr-delayed-01-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
-
Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2277 -
Log in to any patroni node and run sudo gitlab-patronictl list
to determine which node is primary; adjust the steps as described if the primary has changed since this issue was last updated-
Adjustment is not necessary as patroni-06
is still primary -
Adjustment is necessary as patroni-06
is no longer the primary
-
-
Ensure local terraform repository is up to date cd LOCAL_GIT_REPO_DIR/gitlab-com-infrastructure git checkout master git fetch origin git reset --hard origin/master cd environments/gprd
Replicas
patroni-01
-
Run tf apply -target='module.patroni.google_compute_disk.data_disk[0]'
to resize the disk -
Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb'
to validate the new volume size. -
Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
-
Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb'
to verify new filesystem size -
Validate database health -
Check monitoring for anomalies -
Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo tail -n 50f /var/log/postgresql/postgresql-11-main.log
to check the database log for errors -
Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'for ((i=0; i<6; i++)); do echo "$(date +%H:%M:%S): $(netstat -an|grep :5432|wc -l)"; sleep 10; done'
-
patroni-02
-
Run tf apply -target='module.patroni.google_compute_disk.data_disk[1]'
to resize the disk -
Run host=patroni-02-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb'
to validate the new volume size. -
Run host=patroni-02-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
-
Run host=patroni-02-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb'
to verify new filesystem size
patroni-04
-
Run tf apply -target='module.patroni.google_compute_disk.data_disk[3]'
to resize the disk -
Run host=patroni-04-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb'
to validate the new volume size. -
Run host=patroni-04-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
-
Run host=patroni-04-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb'
to verify new filesystem size
patroni-05
-
Run tf apply -target='module.patroni.google_compute_disk.data_disk[4]'
to resize the disk -
Run host=patroni-05-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb'
to validate the new volume size. -
Run host=patroni-05-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
-
Run host=patroni-05-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb'
to verify new filesystem size
patroni-06
-
Run tf apply -target='module.patroni.google_compute_disk.data_disk[5]'
to resize the disk -
Run host=patroni-06-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb'
to validate the new volume size. -
Run host=patroni-06-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
-
Run host=patroni-06-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb'
to verify new filesystem size
patroni-07
-
Run tf apply -target='module.patroni.google_compute_disk.data_disk[6]'
to resize the disk -
Run host=patroni-07-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb'
to validate the new volume size. -
Run host=patroni-07-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
-
Run host=patroni-07-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb'
to verify new filesystem size
patroni-08
-
Run tf apply -target='module.patroni.google_compute_disk.data_disk[7]'
to resize the disk -
Run host=patroni-08-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb'
to validate the new volume size. -
Run host=patroni-08-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
-
Run host=patroni-08-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb'
to verify new filesystem size
Primary
patroni-03
-
Run tf apply -target='module.patroni.google_compute_disk.data_disk[2]'
to resize the disk -
Run host=patroni-03-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb'
to validate the new volume size. -
Run host=patroni-03-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
-
Run host=patroni-03-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb'
to verify new filesystem size
Archive and Delayed Replica
postgres-dr-archive-01
-
Run tf apply -target='module.postgres-dr-archive.google_compute_disk.data_disk[0]'
to resize the disk -
Run host=postgres-dr-archive-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdc'
to validate the new volume size. -
Run host=postgres-dr-archive-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdc'
-
Run host=postgres-dr-archive-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdc'
to verify new filesystem size
postgres-dr-delayed-01
-
Run tf apply -target='module.postgres-dr-delayed.google_compute_disk.data_disk[0]'
to resize the disk -
Run host=postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb'
to validate the new volume size. -
Run host=postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
-
Run host=postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb'
to verify new filesystem size
Post-Change Steps
Verification
N/A - validation steps included as each device is resized above
Clean up
Remove snapshots after final disk is resized. We do not need to retain the snapshots for any length of time, as the data will age out too quickly for them to be effective long-term.
gcloud compute snapshots delete {patroni-0{1..8},postgres-dr-{archive,delayed}-01}-db-gprd-snap-2648
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete per node (mins) - 15 min
-
Revert the production terraform MR to use old data volume size -- DO NOT APPLY!!!
For each node
-
Open two terminals, one remote, one local -
If node is primary, initiate a switchover process
Remote commands
-
Drain connections to the node -
sudo chef-client-disable "Rollback for production#2648"
- disable chef-client - Add a
tags
section to/var/opt/gitlab/patroni/patroni.yml
on the node:tags: nofailover: true noloadbalance: true
sudo systemctl reload patroni
- Test the efficacy of that reload by checking for the node name in the list of replicas:
dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV
- Wait until all client connections are drained from the replica (it depends on the interval value set for the clients), use this command to track number of client connections:
while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
-
-
sudo systemctl stop patroni
- Stop postgresql/patroni -
sudo systemctl stop pgbouncer && sudo systemctl stop pgbouncer-1 && sudo systemctl stop pgbouncer-2
- Stop pgbouncer -
ls -l /dev/disk/by-id|grep DEVICE
whereDEVICE==sdc
onpostgres-dr-archive-01-db-gprd
andsdb
on all other nodes; note the/dev/disk/by-id/google-DEVICE-NAME
-
sudo lsof +f -- /var/opt/gitlab
- Verify there are no open file handles for the volume -
sudo umount /var/opt/gitlab
Local commands
-
gcloud compute instances detach-disk INSTANCE_NAME --disk=DISK --zone=ZONE
-
gcloud compute disks delete DISK --zone ZONE
-
gcloud compute disks create DISK --source-snapshot SNAPSHOT --size 16TB --type pd_ssd --labels do_snapshots='true',environment=gprd,pet_name=patroni --zone ZONE
-
gcloud compute instances attach-disk INSTANCE_NAME --disk DISK --device-name DEVICE-NAME --zone ZONE
-
tf plan
- Perform a terraform plan to validate there are no outstanding changes after the revert and disk resize
Remote commands
-
sudo lsblk /dev/DEVICE
to verify new size, whereDEVICE==sdc
onpostgres-dr-archive-01-db-gprd
andsdb
on all other nodes -
sudo resize2fs /dev/sdb
- Expand the filesystem back to fill the now reduced volume -
sudo mount /var/opt/gitlab
-
df -h /var/opt/gitlab
- Verify filesystem size -
Remove tags from /var/opt/gitlab/patroni/patroni.yml
tags: {}
-
sudo systemctl start patroni
- Restart patroni -
sudo systemctl start pgbouncer && sudo systemctl start pgbouncer-1 && sudo systemctl start pgbouncer-2
- Restart pgbouncer -
sudo gitlab-patronictl list
to verify node re-joined the cluster and check replication lag -
Verify client connections ramping up while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;'; done | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
-
sudo chef-client-enable
- Re-enable chef-client
Fallback
If the above process to gracefully resize the volume back down fails or is otherwise not possible, utilize the snapshots created before starting maintenance to restore the device to it's pre-maintenance state. This will require draining/stopping the instance, unmounting the filesystem, removing the current disk, and deleting it via GCP so that we can recreate a new disk of the same name from the snapshot. If the names change, this will have unintended effects with our current terraform code.
Reference #2115 (closed) for a similar exercise conducted in the past
Monitoring
We should monitor the following dashboards:
- Disk Usage: https://dashboards.gitlab.net/d/dxvAF8ZGk/database-capacity-and-saturation-analysis-7-days?viewPanel=6&orgId=1&from=now-3h&to=now
- IO usage: https://dashboards.gitlab.net/d/dxvAF8ZGk/database-capacity-and-saturation-analysis-7-days?viewPanel=12&orgId=1&from=now-3h&to=now
- TPS: https://dashboards.gitlab.net/d/dxvAF8ZGk/database-capacity-and-saturation-analysis-7-days?viewPanel=2&orgId=1&from=now-3h&to=now
- Load average: https://dashboards.gitlab.net/d/dxvAF8ZGk/database-capacity-and-saturation-analysis-7-days?viewPanel=4&orgId=1&from=now-3h&to=now
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Summary of the above
Changes checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
There are currently no active incidents.