CR - GPRD - Upgrade Ubuntu on PGBouncer nodes
Change Summary
Execution in GSTG: #7966 (closed)
This will upgrade the pgbouncer nodes from 16.04 to 20.04. This is not expected to require downtime, as we will drain connections from each individual pgbouncer node in-turn, and shift traffic to the remaining pgbouncer nodes during the upgrade process.
The number of connections from a pgbouncer node to the underlying database is fixed, so removing a node for service will reduce the number of total available connections by 1/n, where n=number_of_pgbouncer_nodes. To account for this, we will adjust the pool_size attribute in chef to maintain the current overall pool size, and reset it back when maintenance is complete.
Change Details
- Services Impacted - ServicePgbouncer
- Change Technician - @devin
- Change Reviewer - @ayeung @rhenchen.gitlab
- Time tracking - 90 minutes
- Downtime Component - n/a
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 15 minutes
-
Set label changein-progress on this issue -
Ensure that we are running version > 4.6.7of the bootstrap module in production https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4652 -
Set up two terminal sessions in tmuxor multiple windows.-
Change directories to your local config-mgmt/environments/gprddirectory and make sure terraform commands are working. If your vault-proxy alias requires a separate window, open an additional one for that.vault-proxy # or your equivalent local alias vault-login # or execute "vault login -method oidc" outside of the recorded call git pull tf init -upgrade -
Setup the temporary working directory and environment for this CR on your workstation export chef_filter='roles:gprd-base-db-pgbouncer-pool' export gitlab_env=gprd export gitlab_project=gitlab-production export gitlab_region=us-east1 export issue_id='7967' export workdir="/tmp/${USER}-cr${issue_id}" mkdir -p ${workdir} cd ${workdir}
-
-
Validate list of target hosts -
Check current OS versions knife ssh "$chef_filter" "grep '^VERSION=' /etc/os-release" -
Save a copy of the host list
knife search node -i "${chef_filter}" | egrep -v 'pgbouncer-[a-zA-Z0-9_]-test'| sort -u | tee hosts.${gitlab_env}.pgbouncerthe sorted output should match (note that we will be upgrading these in a different order)
15 items found pgbouncer-01-db-gprd.c.gitlab-production.internal pgbouncer-02-db-gprd.c.gitlab-production.internal pgbouncer-03-db-gprd.c.gitlab-production.internal pgbouncer-ci-01-db-gprd.c.gitlab-production.internal pgbouncer-ci-02-db-gprd.c.gitlab-production.internal pgbouncer-ci-03-db-gprd.c.gitlab-production.internal pgbouncer-registry-01-db-gprd.c.gitlab-production.internal pgbouncer-registry-02-db-gprd.c.gitlab-production.internal pgbouncer-registry-03-db-gprd.c.gitlab-production.internal pgbouncer-sidekiq-01-db-gprd.c.gitlab-production.internal pgbouncer-sidekiq-02-db-gprd.c.gitlab-production.internal pgbouncer-sidekiq-03-db-gprd.c.gitlab-production.internal pgbouncer-sidekiq-ci-01-db-gprd.c.gitlab-production.internal pgbouncer-sidekiq-ci-02-db-gprd.c.gitlab-production.internal pgbouncer-sidekiq-ci-03-db-gprd.c.gitlab-production.internal -
-
Take a snapshot of the boot disk for each node. These can be used for later reference or to create images for rollback if necessary. Paste each of these sections into a separate terminal to reduce the time it takes to complete. for host in pgbouncer-{01..03}-db-gprd; do echo -e "\nCreating snapshot for ${host}..." boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink') gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}" donefor host in pgbouncer-ci-{01..03}-db-gprd; do echo -e "\nCreating snapshot for ${host}..." boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink') gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}" donefor host in pgbouncer-registry-{01..03}-db-gprd; do echo -e "\nCreating snapshot for ${host}..." boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink') gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}" donefor host in pgbouncer-sidekiq-{01..03}-db-gprd; do echo -e "\nCreating snapshot for ${host}..." boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink') gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}" donefor host in pgbouncer-sidekiq-ci-{01..03}-db-gprd; do echo -e "\nCreating snapshot for ${host}..." boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink') gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}" doneTO BE COMPLETED IMMEDIATELY PRIOR TO STARTING MAINTENANCE
-
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2506 to update the pool_sizeattribute so the overall connection count for each cluster does not decrease when we remove each node for service -
Ensure the above change has been applied to all nodes before removing any from service; manually trigger chef-clienton target nodes if necessary# TODO: Need to lookup and adjust the attributes to match the clusters being upgraded; this list may not be consistent across all environments # Core pgbouncer nodes (50->75) for node in $(egrep -v 'sidekiq|registry' hosts.${gitlab_env}.pgbouncer); do echo "$node: $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_production.pool_size')" done # Registry pgbouncer nodes (50->75) for node in $(egrep 'registry' hosts.${gitlab_env}.pgbouncer); do echo "$node $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_registry.pool_size')" done # Sidekiq pgbouncer nodes (33->50) for node in $(egrep 'sidekiq' hosts.${gitlab_env}.pgbouncer); do echo "$node $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_production_sidekiq.pool_size')" done -
Disable chef-client on the target nodes to prevent unwanted changes during the maintenance window knife ssh "${chef_filter}" "sudo chef-client-disable 'Suppressing chef-client execution during maintenance (OS upgrades); see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}'" -
Merge config-mgmt!4558 to update the boot image to 20.04 -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2620 chef role change for systemd-resolvdfix
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 minutes
-
pgbouncer-sidekiq-ci-01-db-gprd -
Set/verify target hostname export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" -
Drain connections and stop pgbouncer -
Set pgbouncer to gracefully disconnect open sessions when they become idle ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' -
Tell the load balancer to stop sending new connections to this node gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
-
-
Monitor the progress of this command from another terminal session (outside of the primary tmux session), with either/both the following on the pgbouncer node # show client connection details ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'# or simply watch the total number of active clients ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep -c gitlabhq_production_sidekiq"' -
Shutdown the instance ssh $remote_host 'sudo shutdown' -
Delete the instance knife client delete $remote_host ; knife node delete $remote_host -
Disable deletion protection gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} -
(from terraform directory) Rebuild instance via terraform tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${workdir}/${cluster}.plan -target="module.${cluster}"tf apply ${workdir}/${cluster}.plan -
Monitor progress of rebuild gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 -
Remove the old hostkey from your known_hostsfile so you can connectssh-keygen -R $remote_host -
When startup scripts complete - verify consul connectivity is working ssh $remote_host 'dig replica.patroni.service.consul' -
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-ci-regional?project=gitlab-production -
Verify no increase in error rate here: https://log.gprd.gitlab.net/goto/8ad23b80-7451-11ed-85ed-e7557b0a598c -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-01-db-gprd Details
export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-registry-01-db-gprd Details
export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-registry-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-ci-01-db-gprd Details
export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-ci-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-01-db-gprd Details
export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-ci-02-db-gprd Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-ci-regional?project=gitlab-production -
Verify no increase in error rate here: https://log.gprd.gitlab.net/goto/8ad23b80-7451-11ed-85ed-e7557b0a598c -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-02-db-gprd Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-registry-02-db-gprd Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-registry-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-ci-02-db-gprd Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-ci-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-02-db-gprd Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-ci-03-db-gprd Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-ci-regional?project=gitlab-production -
Verify no increase in error rate here: https://log.gprd.gitlab.net/goto/8ad23b80-7451-11ed-85ed-e7557b0a598c -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-03-db-gprd Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-sidekiq-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-registry-03-db-gprd Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-registry-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-ci-03-db-gprd Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-ci-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-03-db-gprd Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gprd%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 15 minutes
-
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2506 to restore pre-maintenance pool_sizevalues -
Check OS versions on all nodes knife ssh "$chef_filter" "grep '^VERSION=' /etc/os-release -
Re-enable chef-client knife ssh "${chef_filter}" "sudo chef-client-enable"
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 45 minutes
-
Update pgbouncer pool_sizeagain if https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2506 has been reverted -
Drain connections and stop pgbouncer -
Revert terraform MR config-mgmt!4558 to switch boot image back to 16.04 -
Taint the terraform resource for each instance (working serially, following the maintenance plan above) -
Perform a targeted apply for the tainted instance, to re-provision the instance with the old boot image -
Execute post-change/validation steps
Monitoring
Key metrics to observe
- Metric: Server Connection Pool Active Connections per Node
- Location:
- Monitor this panel and related metrics under the
pgbouncer Workloadsection of the dashboard for changes in connection counts as nodes are removed from/restored to service
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.