CR - GPRD - Upgrade Ubuntu on PGBouncer nodes

Change Summary

This will upgrade the pgbouncer nodes from 16.04 to 20.04. This is not expected to require downtime, as we will drain connections from each individual pgbouncer node in-turn, and shift traffic to the remaining pgbouncer nodes during the upgrade process.

The number of connections from a pgbouncer node to the underlying database is fixed, so removing a node for service will reduce the number of total available connections by 1/n, where n=number_of_pgbouncer_nodes. To account for this, we will adjust the pool_size attribute in chef to maintain the current overall pool size, and reset it back when maintenance is complete.

Change Details

Services Impacted - ServicePgbouncer
Change Technician - @devin
Change Reviewer - @ayeung @rhenchen.gitlab
Time tracking - 90 minutes
Downtime Component - n/a

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - 15 minutes

Set label changein-progress on this issue
Ensure that we are running version > 4.6.7 of the bootstrap module in production https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4652

Set up two terminal sessions in tmux or multiple windows.

Change directories to your local config-mgmt/environments/gprd directory and make sure terraform commands are working. If your vault-proxy alias requires a separate window, open an additional one for that.
```
vault-proxy # or your equivalent local alias
vault-login # or execute "vault login -method oidc" outside of the recorded call
git pull
tf init -upgrade
```

Setup the temporary working directory and environment for this CR on your workstation

export chef_filter='roles:gprd-base-db-pgbouncer-pool'
export gitlab_env=gprd
export gitlab_project=gitlab-production
export gitlab_region=us-east1
export issue_id='7967'
export workdir="/tmp/${USER}-cr${issue_id}"
mkdir -p ${workdir}
cd ${workdir}

Validate list of target hosts

Check current OS versions knife ssh "$chef_filter" "grep '^VERSION=' /etc/os-release"
Save a copy of the host list

knife search node -i "${chef_filter}" | egrep -v 'pgbouncer-[a-zA-Z0-9_]-test'| sort -u | tee hosts.${gitlab_env}.pgbouncer

the sorted output should match (note that we will be upgrading these in a different order)

15 items found

pgbouncer-01-db-gprd.c.gitlab-production.internal
pgbouncer-02-db-gprd.c.gitlab-production.internal
pgbouncer-03-db-gprd.c.gitlab-production.internal
pgbouncer-ci-01-db-gprd.c.gitlab-production.internal
pgbouncer-ci-02-db-gprd.c.gitlab-production.internal
pgbouncer-ci-03-db-gprd.c.gitlab-production.internal
pgbouncer-registry-01-db-gprd.c.gitlab-production.internal
pgbouncer-registry-02-db-gprd.c.gitlab-production.internal
pgbouncer-registry-03-db-gprd.c.gitlab-production.internal
pgbouncer-sidekiq-01-db-gprd.c.gitlab-production.internal
pgbouncer-sidekiq-02-db-gprd.c.gitlab-production.internal
pgbouncer-sidekiq-03-db-gprd.c.gitlab-production.internal
pgbouncer-sidekiq-ci-01-db-gprd.c.gitlab-production.internal
pgbouncer-sidekiq-ci-02-db-gprd.c.gitlab-production.internal
pgbouncer-sidekiq-ci-03-db-gprd.c.gitlab-production.internal

Take a snapshot of the boot disk for each node. These can be used for later reference or to create images for rollback if necessary. Paste each of these sections into a separate terminal to reduce the time it takes to complete.

for host in pgbouncer-{01..03}-db-gprd; do
  echo -e "\nCreating snapshot for ${host}..."
  boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
  gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
done

for host in pgbouncer-ci-{01..03}-db-gprd; do
  echo -e "\nCreating snapshot for ${host}..."
  boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
  gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
done

for host in pgbouncer-registry-{01..03}-db-gprd; do
  echo -e "\nCreating snapshot for ${host}..."
  boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
  gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
done

for host in pgbouncer-sidekiq-{01..03}-db-gprd; do
  echo -e "\nCreating snapshot for ${host}..."
  boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
  gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
done

for host in pgbouncer-sidekiq-ci-{01..03}-db-gprd; do
  echo -e "\nCreating snapshot for ${host}..."
  boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink')
  gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}"
done

TO BE COMPLETED IMMEDIATELY PRIOR TO STARTING MAINTENANCE

Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2506 to update the pool_size attribute so the overall connection count for each cluster does not decrease when we remove each node for service

Ensure the above change has been applied to all nodes before removing any from service; manually trigger chef-client on target nodes if necessary

# TODO: Need to lookup and adjust the attributes to match the clusters being upgraded; this list may not be consistent across all environments

# Core pgbouncer nodes (50->75)
for node in $(egrep -v 'sidekiq|registry' hosts.${gitlab_env}.pgbouncer); do
  echo "$node: $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_production.pool_size')"
done

# Registry pgbouncer nodes (50->75)
for node in $(egrep 'registry' hosts.${gitlab_env}.pgbouncer); do
  echo "$node $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_registry.pool_size')"
done

# Sidekiq pgbouncer nodes (33->50)
for node in $(egrep 'sidekiq' hosts.${gitlab_env}.pgbouncer); do
  echo "$node $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_production_sidekiq.pool_size')"
done

Disable chef-client on the target nodes to prevent unwanted changes during the maintenance window

knife ssh "${chef_filter}" "sudo chef-client-disable 'Suppressing chef-client execution during maintenance (OS upgrades); see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}'"

Merge config-mgmt!4558 to update the boot image to 20.04
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2620 chef role change for systemd-resolvd fix

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 30 minutes

pgbouncer-sidekiq-ci-01-db-gprd
1. Set/verify target hostname
```
export i=01
export j=1
export zone='us-east1-c'
export cluster='pgbouncer-sidekiq-ci'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
```
2. Drain connections and stop pgbouncer
  1. Set pgbouncer to gracefully disconnect open sessions when they become idle
```
ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'
```
  2. Tell the load balancer to stop sending new connections to this node
```
gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
```
3. Monitor the progress of this command from another terminal session (outside of the primary tmux session), with either/both the following on the pgbouncer node
```
# show client connection details
ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'
```
```
# or simply watch the total number of active clients
ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep -c gitlabhq_production_sidekiq"'
```
4. Shutdown the instance
```
ssh $remote_host 'sudo shutdown'
```
5. Delete the instance
```
knife client delete $remote_host ; knife node delete $remote_host
```
6. Disable deletion protection
```
gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}
```
7. (from terraform directory) Rebuild instance via terraform
```
tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${workdir}/${cluster}.plan -target="module.${cluster}"
```
```
tf apply ${workdir}/${cluster}.plan
```
8. Monitor progress of rebuild
```
gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1
```
9. Remove the old hostkey from your known_hosts file so you can connect
```
ssh-keygen -R $remote_host
```
10. When startup scripts complete - verify consul connectivity is working
```
ssh $remote_host 'dig replica.patroni.service.consul'
```
11. Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-ci-regional?project=gitlab-production
12. Verify no increase in error rate here: https://log.gprd.gitlab.net/goto/8ad23b80-7451-11ed-85ed-e7557b0a598c
13. Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
14. Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
15. Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-sidekiq-01-db-gprd

Details

export i=01
export j=1
export zone='us-east1-c'
export cluster='pgbouncer-sidekiq'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-registry-01-db-gprd

Details

export i=01
export j=1
export zone='us-east1-c'
export cluster='pgbouncer-registry'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} 

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-registry-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-ci-01-db-gprd

Details

export i=01
export j=1
export zone='us-east1-c'
export cluster='pgbouncer-ci'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-ci-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-01-db-gprd

Details

export i=01
export j=1
export zone='us-east1-c'
export cluster='pgbouncer'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-sidekiq-ci-02-db-gprd

Details

export i=02
export j=2
export zone='us-east1-d'
export cluster='pgbouncer-sidekiq-ci'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-ci-regional?project=gitlab-production
Verify no increase in error rate here: https://log.gprd.gitlab.net/goto/8ad23b80-7451-11ed-85ed-e7557b0a598c
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-sidekiq-02-db-gprd

Details

export i=02
export j=2
export zone='us-east1-d'
export cluster='pgbouncer-sidekiq'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-registry-02-db-gprd

Details

export i=02
export j=2
export zone='us-east1-d'
export cluster='pgbouncer-registry'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-registry-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-ci-02-db-gprd

Details

export i=02
export j=2
export zone='us-east1-d'
export cluster='pgbouncer-ci'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-ci-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-02-db-gprd

Details

export i=02
export j=2
export zone='us-east1-d'
export cluster='pgbouncer'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} 

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-sidekiq-ci-03-db-gprd

Details

export i=03
export j=0
export zone='us-east1-b'
export cluster='pgbouncer-sidekiq-ci'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-ci-regional?project=gitlab-production
Verify no increase in error rate here: https://log.gprd.gitlab.net/goto/8ad23b80-7451-11ed-85ed-e7557b0a598c
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-sidekiq-03-db-gprd

Details

export i=03
export j=0
export zone='us-east1-b'
export cluster='pgbouncer-sidekiq'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-registry-03-db-gprd

Details

export i=03
export j=0
export zone='us-east1-b'
export cluster='pgbouncer-registry'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} 

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-registry-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-ci-03-db-gprd

Details

export i=03
export j=0
export zone='us-east1-b'
export cluster='pgbouncer-ci'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-ci-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Repeat for pgbouncer-03-db-gprd

Details

export i=03
export j=0
export zone='us-east1-b'
export cluster='pgbouncer'
export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"

ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"'
ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout'

gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}

ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"'

ssh $remote_host 'sudo shutdown'

# from workstation
knife client delete $remote_host ; knife node delete $remote_host

gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone}

tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}"
tf apply ${cluster}.plan

gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 

ssh-keygen -R $remote_host

ssh $remote_host 'dig replica.patroni.service.consul'

Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 15 minutes

Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2506 to restore pre-maintenance pool_size values
Check OS versions on all nodes knife ssh "$chef_filter" "grep '^VERSION=' /etc/os-release
Re-enable chef-client knife ssh "${chef_filter}" "sudo chef-client-enable"

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 45 minutes

Update pgbouncer pool_size again if https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2506 has been reverted
Drain connections and stop pgbouncer
Revert terraform MR config-mgmt!4558 to switch boot image back to 16.04
Taint the terraform resource for each instance (working serially, following the maintenance plan above)
Perform a targeted apply for the tainted instance, to re-provision the instance with the old boot image
Execute post-change/validation steps

Monitoring

Key metrics to observe

Metric: Server Connection Pool Active Connections per Node
1. Location:
2. Monitor this panel and related metrics under the pgbouncer Workload section of the dashboard for changes in connection counts as nodes are removed from/restored to service

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Dec 13, 2022 by Devin Sylva