Skip to content

Increase the disks allocated to the Patroni Cluster

Change Summary

At the time of this writing, the disks from the Patroni cluster are over 88.5% full.

We've had to execute a pg_repack several times, to reduce the bloat and release free space.

This issue covers the production DB disk size increase from 10 TB to 16 TB.

This maintenance has to be done in 2 steps:

  1. Increase the disk size in Terraform. Be careful to execute this step in LOW PEAK TIME. Since will impact the performance of the hosts. I recommend execute in the GCP node by node, and them update TF to be consistent with the values.
  2. After we need to execute a resize2fs in each host that we increased the disk. It is recommended to execute in off-peak time too.

The command is : resize2fs /dev/sdb

Change Details

  1. Services Impacted - Database
  2. Change Technician - Craig Barrett, as EOC
  3. Change Criticality - C1,
  4. Change Type - changescheduled
  5. Change Reviewer - Henri Philips
  6. Due Date - Sunday 17th January 2021, 00:00 UTC
  7. Time tracking - 1 hour
  8. Downtime Component - this change will not require downtime

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

  1. Collect a list of affected volumes in this issue for subsequent command/validation output
  2. Record disk device name, size, and filesystem utilization on all hosts for post-implementation comparison/validation (ref: #2648 (comment 485754664))
  3. Identify dashboards/metrics required to monitor performance and functionality during the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes.

  1. Participants join the persistent Incident Zoom call ahead of Sunday 17th January 2021, 00:00 UTC
  2. Create snapshots for all data volumes before merging/applying the terraform MR. For optimal time, create snapshots in parallel for each availability zone.
    # Parallel by zone
    snaps=$(echo {patroni-0{1,4,7},postgres-dr-{archive,delayed}-01}-db-gprd-data-snap-2648|tr ' ' ',')
    gcloud compute disks snapshot {patroni-0{1,4,7},postgres-dr-{archive,delayed}-01}-db-gprd-data --description "production#2648 patroni data disk resize $(date -u +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${snaps} --zone=us-east1-c
    
    snaps=$(echo patroni-0{2,5,8}-db-gprd-data-snap-2648|tr ' ' ',')
    gcloud compute disks snapshot patroni-0{2,5,8}-db-gprd-data --description "production#2648 patroni data disk resize $(date -u +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${snaps} --zone=us-east1-d
    
    snaps=$(echo patroni-0{3,6}-db-gprd-data-snap-2648|tr ' ' ',')
    gcloud compute disks snapshot patroni-0{3,6}-db-gprd-data --description "production#2648 patroni data disk resize $(date -u +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${snaps} --zone=us-east1-b
    
    # Individually
    disk=patroni-01-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
    disk=patroni-02-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-d
    disk=patroni-03-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-b
    disk=patroni-04-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
    disk=patroni-05-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-d
    disk=patroni-06-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-b
    disk=patroni-07-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
    disk=patroni-08-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-d
    disk=postgres-dr-archive-01-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
    disk=postgres-dr-delayed-01-db-gprd-data; gcloud compute disks snapshot ${disk} --description "production#2648 patroni data disk resize $(date +%Y-%m-%dT%H:%M:%S%z)" --snapshot-names=${disk}-snap-2648 --zone=us-east1-c
  3. Merge https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2277
  4. Log in to any patroni node and run sudo gitlab-patronictl list to determine which node is primary; adjust the steps as described if the primary has changed since this issue was last updated
    1. Adjustment is not necessary as patroni-06 is still primary
    2. Adjustment is necessary as patroni-06 is no longer the primary
  5. Ensure local terraform repository is up to date
    cd LOCAL_GIT_REPO_DIR/gitlab-com-infrastructure
    git checkout master
    git fetch origin
    git reset --hard origin/master
    cd environments/gprd

Replicas

patroni-01
  1. Run tf apply -target='module.patroni.google_compute_disk.data_disk[0]' to resize the disk
  2. Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
  3. Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
  4. Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size
  5. Validate database health
    1. Check monitoring for anomalies
    2. Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo tail -n 50f /var/log/postgresql/postgresql-11-main.log to check the database log for errors
    3. Run host=patroni-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'for ((i=0; i<6; i++)); do echo "$(date +%H:%M:%S): $(netstat -an|grep :5432|wc -l)"; sleep 10; done'
patroni-02
  1. Run tf apply -target='module.patroni.google_compute_disk.data_disk[1]' to resize the disk
  2. Run host=patroni-02-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
  3. Run host=patroni-02-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
  4. Run host=patroni-02-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size
patroni-04
  1. Run tf apply -target='module.patroni.google_compute_disk.data_disk[3]' to resize the disk
  2. Run host=patroni-04-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
  3. Run host=patroni-04-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
  4. Run host=patroni-04-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size
patroni-05
  1. Run tf apply -target='module.patroni.google_compute_disk.data_disk[4]' to resize the disk
  2. Run host=patroni-05-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
  3. Run host=patroni-05-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
  4. Run host=patroni-05-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size
patroni-06
  1. Run tf apply -target='module.patroni.google_compute_disk.data_disk[5]' to resize the disk
  2. Run host=patroni-06-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
  3. Run host=patroni-06-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
  4. Run host=patroni-06-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size
patroni-07
  1. Run tf apply -target='module.patroni.google_compute_disk.data_disk[6]' to resize the disk
  2. Run host=patroni-07-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
  3. Run host=patroni-07-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
  4. Run host=patroni-07-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size
patroni-08
  1. Run tf apply -target='module.patroni.google_compute_disk.data_disk[7]' to resize the disk
  2. Run host=patroni-08-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
  3. Run host=patroni-08-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
  4. Run host=patroni-08-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

Primary

patroni-03
  1. Run tf apply -target='module.patroni.google_compute_disk.data_disk[2]' to resize the disk
  2. Run host=patroni-03-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
  3. Run host=patroni-03-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
  4. Run host=patroni-03-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

Archive and Delayed Replica

postgres-dr-archive-01
  1. Run tf apply -target='module.postgres-dr-archive.google_compute_disk.data_disk[0]' to resize the disk
  2. Run host=postgres-dr-archive-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdc' to validate the new volume size.
  3. Run host=postgres-dr-archive-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdc'
  4. Run host=postgres-dr-archive-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdc' to verify new filesystem size
postgres-dr-delayed-01
  1. Run tf apply -target='module.postgres-dr-delayed.google_compute_disk.data_disk[0]' to resize the disk
  2. Run host=postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'lsblk /dev/sdb' to validate the new volume size.
  3. Run host=postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'sudo resize2fs /dev/sdb'
  4. Run host=postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal; ssh ${host} 'df -h /dev/sdb' to verify new filesystem size

Post-Change Steps

Verification

N/A - validation steps included as each device is resized above

Clean up

Remove snapshots after final disk is resized. We do not need to retain the snapshots for any length of time, as the data will age out too quickly for them to be effective long-term.

gcloud compute snapshots delete {patroni-0{1..8},postgres-dr-{archive,delayed}-01}-db-gprd-snap-2648

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete per node (mins) - 15 min

  1. Revert the production terraform MR to use old data volume size -- DO NOT APPLY!!!

For each node

  1. Open two terminals, one remote, one local
  2. If node is primary, initiate a switchover process
Remote commands
  1. Drain connections to the node
    1. sudo chef-client-disable "Rollback for production#2648" - disable chef-client
    2. Add a tags section to /var/opt/gitlab/patroni/patroni.yml on the node:
      tags:
        nofailover: true
        noloadbalance: true
    3. sudo systemctl reload patroni
    4. Test the efficacy of that reload by checking for the node name in the list of replicas:
      dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV
      If the name is absent, then the reload worked.
    5. Wait until all client connections are drained from the replica (it depends on the interval value set for the clients), use this command to track number of client connections:
      while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;';  done  | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
  2. sudo systemctl stop patroni - Stop postgresql/patroni
  3. sudo systemctl stop pgbouncer && sudo systemctl stop pgbouncer-1 && sudo systemctl stop pgbouncer-2 - Stop pgbouncer
  4. ls -l /dev/disk/by-id|grep DEVICE where DEVICE==sdc on postgres-dr-archive-01-db-gprd and sdb on all other nodes; note the /dev/disk/by-id/google-DEVICE-NAME
  5. sudo lsof +f -- /var/opt/gitlab - Verify there are no open file handles for the volume
  6. sudo umount /var/opt/gitlab
Local commands
  1. gcloud compute instances detach-disk INSTANCE_NAME --disk=DISK --zone=ZONE
  2. gcloud compute disks delete DISK --zone ZONE
  3. gcloud compute disks create DISK --source-snapshot SNAPSHOT --size 16TB --type pd_ssd --labels do_snapshots='true',environment=gprd,pet_name=patroni --zone ZONE
  4. gcloud compute instances attach-disk INSTANCE_NAME --disk DISK --device-name DEVICE-NAME --zone ZONE
  5. tf plan - Perform a terraform plan to validate there are no outstanding changes after the revert and disk resize
Remote commands
  1. sudo lsblk /dev/DEVICE to verify new size, where DEVICE==sdc on postgres-dr-archive-01-db-gprd and sdb on all other nodes
  2. sudo resize2fs /dev/sdb - Expand the filesystem back to fill the now reduced volume
  3. sudo mount /var/opt/gitlab
  4. df -h /var/opt/gitlab - Verify filesystem size
  5. Remove tags from /var/opt/gitlab/patroni/patroni.yml tags: {}
  6. sudo systemctl start patroni - Restart patroni
  7. sudo systemctl start pgbouncer && sudo systemctl start pgbouncer-1 && sudo systemctl start pgbouncer-2 - Restart pgbouncer
  8. sudo gitlab-patronictl list to verify node re-joined the cluster and check replication lag
  9. Verify client connections ramping up
    while true; do for c in /usr/local/bin/pgb-console*; do sudo $c -c 'SHOW CLIENTS;';  done  | grep gitlabhq_production | cut -d '|' -f 2 | awk '{$1=$1};1' | grep -v gitlab-monitor | wc -l; sleep 5; done
  10. sudo chef-client-enable - Re-enable chef-client

Fallback

If the above process to gracefully resize the volume back down fails or is otherwise not possible, utilize the snapshots created before starting maintenance to restore the device to it's pre-maintenance state. This will require draining/stopping the instance, unmounting the filesystem, removing the current disk, and deleting it via GCP so that we can recreate a new disk of the same name from the snapshot. If the names change, this will have unintended effects with our current terraform code.

Reference #2115 (closed) for a similar exercise conducted in the past

Monitoring

We should monitor the following dashboards:

Summary of infrastructure changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Summary of the above

Changes checklist

  • This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities.
  • This issue has the change technician as the assignee.
  • Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed.
  • Necessary approvals have been completed based on the Change Management Workflow.
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
  • There are currently no active incidents.
Edited by Craig Barrett