CR - GPRD - Upgrade Ubuntu on PGBouncer nodes

Change Summary

Execution in GSTG: #7966 (closed)

This will upgrade the pgbouncer nodes from 16.04 to 20.04. This is not expected to require downtime, as we will drain connections from each individual pgbouncer node in-turn, and shift traffic to the remaining pgbouncer nodes during the upgrade process.

The number of connections from a pgbouncer node to the underlying database is fixed, so removing a node for service will reduce the number of total available connections by 1/n, where n=number_of_pgbouncer_nodes. To account for this, we will adjust the pool_size attribute in chef to maintain the current overall pool size, and reset it back when maintenance is complete.

Change Details

  1. Services Impacted - ServicePgbouncer
  2. Change Technician - @devin
  3. Change Reviewer - @ayeung @rhenchen.gitlab
  4. Time tracking - 90 minutes
  5. Downtime Component - n/a

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 15 minutes

  1. Set label changein-progress on this issue

  2. Ensure that we are running version > 4.6.7 of the bootstrap module in production https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4652

  3. Set up two terminal sessions in tmux or multiple windows.

    1. Change directories to your local config-mgmt/environments/gprd directory and make sure terraform commands are working. If your vault-proxy alias requires a separate window, open an additional one for that.

      vault-proxy # or your equivalent local alias
      vault-login # or execute "vault login -method oidc" outside of the recorded call
      git pull
      tf init -upgrade
    2. Setup the temporary working directory and environment for this CR on your workstation

      export chef_filter='roles:gprd-base-db-pgbouncer-pool'
      export gitlab_env=gprd
      export gitlab_project=gitlab-production
      export gitlab_region=us-east1
      export issue_id='7967'
      export workdir="/tmp/${USER}-cr${issue_id}"
      mkdir -p ${workdir}
      cd ${workdir}
  4. Validate list of target hosts

    1. Check current OS versions knife ssh "$chef_filter" "grep '^VERSION=' /etc/os-release"
    2. Save a copy of the host list
    knife search node -i "${chef_filter}" | egrep -v 'pgbouncer-[a-zA-Z0-9_]-test'| sort -u | tee hosts.${gitlab_env}.pgbouncer

    the sorted output should match (note that we will be upgrading these in a different order)

    15 items found
    
    pgbouncer-01-db-gprd.c.gitlab-production.internal
    pgbouncer-02-db-gprd.c.gitlab-production.internal
    pgbouncer-03-db-gprd.c.gitlab-production.internal
    pgbouncer-ci-01-db-gprd.c.gitlab-production.internal
    pgbouncer-ci-02-db-gprd.c.gitlab-production.internal
    pgbouncer-ci-03-db-gprd.c.gitlab-production.internal
    pgbouncer-registry-01-db-gprd.c.gitlab-production.internal
    pgbouncer-registry-02-db-gprd.c.gitlab-production.internal
    pgbouncer-registry-03-db-gprd.c.gitlab-production.internal
    pgbouncer-sidekiq-01-db-gprd.c.gitlab-production.internal
    pgbouncer-sidekiq-02-db-gprd.c.gitlab-production.internal
    pgbouncer-sidekiq-03-db-gprd.c.gitlab-production.internal
    pgbouncer-sidekiq-ci-01-db-gprd.c.gitlab-production.internal
    pgbouncer-sidekiq-ci-02-db-gprd.c.gitlab-production.internal
    pgbouncer-sidekiq-ci-03-db-gprd.c.gitlab-production.internal
  5. Take a snapshot of the boot disk for each node. These can be used for later reference or to create images for rollback if necessary. Paste each of these sections into a separate terminal to reduce the time it takes to complete.

    for host in pgbouncer-{01..03}-db-gprd; do
      echo -e "\nCreating snapshot for ${host}..."
      boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
      gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
    done
    for host in pgbouncer-ci-{01..03}-db-gprd; do
      echo -e "\nCreating snapshot for ${host}..."
      boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
      gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
    done
    for host in pgbouncer-registry-{01..03}-db-gprd; do
      echo -e "\nCreating snapshot for ${host}..."
      boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
      gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
    done
    for host in pgbouncer-sidekiq-{01..03}-db-gprd; do
      echo -e "\nCreating snapshot for ${host}..."
      boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
      gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
    done
    for host in pgbouncer-sidekiq-ci-{01..03}-db-gprd; do
      echo -e "\nCreating snapshot for ${host}..."
      boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
      gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
    done

    TO BE COMPLETED IMMEDIATELY PRIOR TO STARTING MAINTENANCE

  6. Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2506 to update the pool_size attribute so the overall connection count for each cluster does not decrease when we remove each node for service

  7. Ensure the above change has been applied to all nodes before removing any from service; manually trigger chef-client on target nodes if necessary

    # TODO: Need to lookup and adjust the attributes to match the clusters being upgraded; this list may not be consistent across all environments
    
    # Core pgbouncer nodes (50->75)
    for node in $(egrep -v 'sidekiq|registry' hosts.${gitlab_env}.pgbouncer); do
      echo "$node: $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_production.pool_size')"
    done
    
    # Registry pgbouncer nodes (50->75)
    for node in $(egrep 'registry' hosts.${gitlab_env}.pgbouncer); do
      echo "$node $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_registry.pool_size')"
    done
    
    # Sidekiq pgbouncer nodes (33->50)
    for node in $(egrep 'sidekiq' hosts.${gitlab_env}.pgbouncer); do
      echo "$node $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_production_sidekiq.pool_size')"
    done
  8. Disable chef-client on the target nodes to prevent unwanted changes during the maintenance window

    knife ssh "${chef_filter}" "sudo chef-client-disable 'Suppressing chef-client execution during maintenance (OS upgrades); see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}'"
  9. Merge config-mgmt!4558 to update the boot image to 20.04

  10. Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2620 chef role change for systemd-resolvd fix

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

  1. pgbouncer-sidekiq-ci-01-db-gprd

    1. Set/verify target hostname

      export i=01
      export j=1
      export zone='us-east1-c'
      export cluster='pgbouncer-sidekiq-ci'
      export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    2. Drain connections and stop pgbouncer

      1. Set pgbouncer to gracefully disconnect open sessions when they become idle

        ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
        ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
      2. Tell the load balancer to stop sending new connections to this node

        gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    3. Monitor the progress of this command from another terminal session (outside of the primary tmux session), with either/both the following on the pgbouncer node

      # show client connection details
      ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'
      # or simply watch the total number of active clients
      ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep -c gitlabhq_production_sidekiq"'
    4. Shutdown the instance

      ssh $remote_host 'sudo shutdown'
    5. Delete the instance

      knife client delete $remote_host ; knife node delete $remote_host
    6. Disable deletion protection

      gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    7. (from terraform directory) Rebuild instance via terraform

      tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${workdir}/${cluster}.plan -target="module.${cluster}"
      tf apply ${workdir}/${cluster}.plan
    8. Monitor progress of rebuild

      gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1
    9. Remove the old hostkey from your known_hosts file so you can connect

      ssh-keygen -R $remote_host
    10. When startup scripts complete - verify consul connectivity is working

      ssh $remote_host 'dig replica.patroni.service.consul'
    11. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-ci-regional?project=gitlab-production

    12. Verify no increase in error rate here: https://log.gprd.gitlab.net/goto/8ad23b80-7451-11ed-85ed-e7557b0a598c

    13. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now

    14. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main

    15. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

  2. Repeat for pgbouncer-sidekiq-01-db-gprd

    Details

    export i=01
    export j=1
    export zone='us-east1-c'
    export cluster='pgbouncer-sidekiq'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  3. Repeat for pgbouncer-registry-01-db-gprd

    Details

    export i=01
    export j=1
    export zone='us-east1-c'
    export cluster='pgbouncer-registry'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} 
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-registry-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  4. Repeat for pgbouncer-ci-01-db-gprd

    Details

    export i=01
    export j=1
    export zone='us-east1-c'
    export cluster='pgbouncer-ci'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-ci-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  5. Repeat for pgbouncer-01-db-gprd

    Details

    export i=01
    export j=1
    export zone='us-east1-c'
    export cluster='pgbouncer'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  6. Repeat for pgbouncer-sidekiq-ci-02-db-gprd

    Details

    export i=02
    export j=2
    export zone='us-east1-d'
    export cluster='pgbouncer-sidekiq-ci'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-ci-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://log.gprd.gitlab.net/goto/8ad23b80-7451-11ed-85ed-e7557b0a598c
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  7. Repeat for pgbouncer-sidekiq-02-db-gprd

    Details

    export i=02
    export j=2
    export zone='us-east1-d'
    export cluster='pgbouncer-sidekiq'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  8. Repeat for pgbouncer-registry-02-db-gprd

    Details

    export i=02
    export j=2
    export zone='us-east1-d'
    export cluster='pgbouncer-registry'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-registry-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  9. Repeat for pgbouncer-ci-02-db-gprd

    Details

    export i=02
    export j=2
    export zone='us-east1-d'
    export cluster='pgbouncer-ci'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-ci-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  10. Repeat for pgbouncer-02-db-gprd

    Details

    export i=02
    export j=2
    export zone='us-east1-d'
    export cluster='pgbouncer'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} 
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  11. Repeat for pgbouncer-sidekiq-ci-03-db-gprd

    Details

    export i=03
    export j=0
    export zone='us-east1-b'
    export cluster='pgbouncer-sidekiq-ci'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-ci-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://log.gprd.gitlab.net/goto/8ad23b80-7451-11ed-85ed-e7557b0a598c
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  12. Repeat for pgbouncer-sidekiq-03-db-gprd

    Details

    export i=03
    export j=0
    export zone='us-east1-b'
    export cluster='pgbouncer-sidekiq'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  13. Repeat for pgbouncer-registry-03-db-gprd

    Details

    export i=03
    export j=0
    export zone='us-east1-b'
    export cluster='pgbouncer-registry'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} 
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-registry-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  14. Repeat for pgbouncer-ci-03-db-gprd

    Details

    export i=03
    export j=0
    export zone='us-east1-b'
    export cluster='pgbouncer-ci'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-ci-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
  15. Repeat for pgbouncer-03-db-gprd

    Details

    export i=03
    export j=0
    export zone='us-east1-b'
    export cluster='pgbouncer'
    export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
    
    ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
    ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
    
    gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
    
    ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'
    
    ssh $remote_host 'sudo shutdown'
    
    # from workstation
    knife client delete $remote_host ; knife node delete $remote_host
    
    gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
    
    tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
    tf apply ${cluster}.plan
    
    gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 
    
    ssh-keygen -R $remote_host
    
    ssh $remote_host 'dig replica.patroni.service.consul'
    1. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production
    2. Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
    3. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
    4. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
    5. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 15 minutes

  1. Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2506 to restore pre-maintenance pool_size values

  2. Check OS versions on all nodes knife ssh "$chef_filter" "grep '^VERSION=' /etc/os-release

  3. Re-enable chef-client knife ssh "${chef_filter}" "sudo chef-client-enable"

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 45 minutes

  1. Update pgbouncer pool_size again if https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2506 has been reverted
  2. Drain connections and stop pgbouncer
  3. Revert terraform MR config-mgmt!4558 to switch boot image back to 16.04
  4. Taint the terraform resource for each instance (working serially, following the maintenance plan above)
  5. Perform a targeted apply for the tainted instance, to re-provision the instance with the old boot image
  6. Execute post-change/validation steps

Monitoring

Key metrics to observe

  1. Metric: Server Connection Pool Active Connections per Node
    1. Location:
    2. Monitor this panel and related metrics under the pgbouncer Workload section of the dashboard for changes in connection counts as nodes are removed from/restored to service

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • The change execution window respects the Production Change Lock periods.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Devin Sylva