CR - GSTG - Upgrade Ubuntu on PGBouncer nodes
Change Summary
This will test the upgrade of the pgbouncer nodes from 16.04 to 20.04 which we will do in production. The test environment is already running 20.04, so it does not actually need to be upgraded, but since the process and pitfalls will be the same as production, we will do run through the process anyway. This is not expected to require downtime, as we will drain connections from each individual pgbouncer node in-turn, and shift traffic to the remaining pgbouncer nodes during the upgrade process.
The number of connections from a pgbouncer node to the underlying database is fixed, so removing a node for service will reduce the number of total available connections by 1/n, where n=number_of_pgbouncer_nodes. To account for this, we will adjust the pool_size attribute in chef to maintain the current overall pool size, and reset it back when maintenance is complete.
Change Details
- Services Impacted - ServicePgbouncer
- Change Technician - @devin
- Change Reviewer - @ayeung
- Time tracking - 90 minutes
- Downtime Component - n/a
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 15 minutes
-
Set label changein-progress on this issue -
Ensure that we are running version > 4.6.7of the bootstrap module in staging https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4618 -
Set up two terminal sessions in tmuxor multiple windows.-
Change directories to your local config-mgmt/environments/gstgdirectory and make sure terraform commands are working. If your vault-proxy alias requires a separate window, open an additional one for that.vault-proxy # or your equivalent local alias vault-login # or execute "vault login -method oidc" outside of the recorded call git pull tf init -upgrade -
Setup the temporary working directory and environment for this CR on your workstation export chef_filter='roles:gstg-base-db-pgbouncer-pool' export gitlab_env=gstg export gitlab_project=gitlab-staging-1 export gitlab_region=us-east1 export issue_id='7966' export workdir="/tmp/${USER}-cr${issue_id}" mkdir -p ${workdir} cd ${workdir}
-
-
Validate list of target hosts -
Check current OS versions knife ssh "$chef_filter" "grep '^VERSION=' /etc/os-release" -
Save a copy of the host list knife search node -i "${chef_filter}" | egrep -v 'pgbouncer-[a-zA-Z0-9_]-test'| sort -u | tee hosts.${gitlab_env}.pgbouncerthe sorted output should match (note that we will be upgrading these in a different order)
15 items found pgbouncer-01-db-gstg.c.gitlab-staging-1.internal pgbouncer-02-db-gstg.c.gitlab-staging-1.internal pgbouncer-03-db-gstg.c.gitlab-staging-1.internal pgbouncer-ci-01-db-gstg.c.gitlab-staging-1.internal pgbouncer-ci-02-db-gstg.c.gitlab-staging-1.internal pgbouncer-ci-03-db-gstg.c.gitlab-staging-1.internal pgbouncer-registry-01-db-gstg.c.gitlab-staging-1.internal pgbouncer-registry-02-db-gstg.c.gitlab-staging-1.internal pgbouncer-registry-03-db-gstg.c.gitlab-staging-1.internal pgbouncer-sidekiq-01-db-gstg.c.gitlab-staging-1.internal pgbouncer-sidekiq-02-db-gstg.c.gitlab-staging-1.internal pgbouncer-sidekiq-03-db-gstg.c.gitlab-staging-1.internal pgbouncer-sidekiq-ci-01-db-gstg.c.gitlab-staging-1.internal pgbouncer-sidekiq-ci-02-db-gstg.c.gitlab-staging-1.internal pgbouncer-sidekiq-ci-03-db-gstg.c.gitlab-staging-1.internal
-
-
Take a snapshot of the boot disk for each node. These can be used for later reference or to create images for rollback if necessary. for host in $(sed "s/\.c\.$gitlab_project\.internal//" hosts.${gitlab_env}.pgbouncer); do echo -e "\nCreating snapshot for ${host}..." boot_disk=$(gcloud --project $gitlab_project compute disks list --format=json --filter="name~^${host}\$" | jq -r '.[].selfLink') gcloud --project $gitlab_project compute disks snapshot ${boot_disk} --snapshot-names="${host}-${issue_id}-$(date '+%Y%m%d')" --description="Backup snapshot taken prior to executing https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}" doneTO BE COMPLETED IMMEDIATELY PRIOR TO STARTING MAINTENANCE
-
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2505 to update the pool_sizeattribute so the overall connection count for each cluster does not decrease when we remove each node for service (Already done in staging but this will be important for production) -
Ensure the above change has been applied to all nodes before removing any from service; manually trigger chef-clienton target nodes if necessary# Core pgbouncer nodes (50->75) for node in $(egrep -v 'sidekiq|registry' hosts.${gitlab_env}.pgbouncer); do echo "$node: $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_production.pool_size')" done # Registry pgbouncer nodes (50->75) for node in $(egrep 'registry' hosts.${gitlab_env}.pgbouncer); do echo "$node $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_registry.pool_size')" done # Sidekiq pgbouncer nodes (33->50) for node in $(egrep 'sidekiq' hosts.${gitlab_env}.pgbouncer); do echo "$node $(knife node attribute get $node 'gitlab-pgbouncer.databases.gitlabhq_production_sidekiq.pool_size')" done -
Disable chef-client on the target nodes to prevent unwanted changes during the maintenance window knife ssh "${chef_filter}" "sudo chef-client-disable 'Suppressing chef-client execution during maintenance (OS upgrades); see https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}'" -
Merge config-mgmt!3332 to update the boot image to 20.04 (Already done in staging but this will be important for production) -
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2619 chef role change for systemd-resolvdfix (staging change is just a cleanup - production will be important)
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 30 minutes
-
pgbouncer-sidekiq-ci-01-db-gstg -
Set/verify target hostname export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" -
Drain connections and stop pgbouncer -
Set pgbouncer to gracefully disconnect open sessions when they become idle ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' -
Tell the load balancer to stop sending new connections to this node gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project}
-
-
Monitor the progress of this command from another terminal session (outside of the primary tmux session), with either/both the following on the pgbouncer node # show client connection details ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"'# or simply watch the total number of active clients ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep -c gitlabhq_production_sidekiq"' -
Shutdown the instance ssh $remote_host 'sudo shutdown' -
Delete the instance knife client delete $remote_host ; knife node delete $remote_host -
Disable deletion protection gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} -
(from terraform directory) Rebuild instance via terraform tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${workdir}/${cluster}.plan -target="module.${cluster}"tf apply ${workdir}/${cluster}.plan -
Monitor progress of rebuild gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 -
Remove the old hostkey from your known_hostsfile so you can connectssh-keygen -R $remote_host -
When startup scripts complete - verify consul connectivity is working ssh $remote_host 'dig replica.patroni.service.consul' -
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-sidekiq-ci-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-01-db-gstg Details
export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host ; knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-sidekiq-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-registry-01-db-gstg Details
export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-registry-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-ci-01-db-gstg Details
export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-ci-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-01-db-gstg Details
export i=01 export j=1 export zone='us-east1-c' export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-ci-02-db-gstg Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-sidekiq-ci-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-02-db-gstg Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-sidekiq-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-registry-02-db-gstg Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-registry-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-ci-02-db-gstg Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-ci-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-02-db-gstg Details
export i=02 export j=2 export zone='us-east1-d' export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-ci-03-db-gstg Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-sidekiq-ci-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-sidekiq-03-db-gstg Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production_sidekiq"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-sidekiq-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-registry-03-db-gstg Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_registry"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-registry-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer-registry%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-ci-03-db-gstg Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-ci-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-3h&to=now -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer-ci%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
-
Repeat for pgbouncer-03-db-gstg Details
export i=03 export j=0 export zone='us-east1-b' export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host 'sudo pgb-console -c "SET client_idle_timeout = 30;"' ssh $remote_host 'sudo pgb-console -c "SHOW CONFIG;" | grep client_idle_timeout' gcloud compute backend-services remove-backend ${gitlab_env}-${cluster}-regional --instance-group=${gitlab_env}-${cluster}-${zone} --instance-group-zone=${zone} --region=${gitlab_region} --project=${gitlab_project} ssh -t $remote_host 'watch -n 5 "sudo pgb-console -c \"SHOW CLIENTS;\" | grep gitlabhq_production"' ssh $remote_host 'sudo shutdown' # from workstation knife client delete $remote_host && knife node delete $remote_host gcloud --project $gitlab_project compute instances update ${cluster}-$i-db-${gitlab_env} --no-deletion-protection --zone=${zone} tf plan -replace="module.${cluster}.google_compute_instance.default[$(expr $i - 1)]" -replace="module.${cluster}.google_compute_instance_group.default[$j]" -out=${cluster}.plan -target="module.${cluster}" tf apply ${cluster}.plan gcloud compute --project=${gitlab_project} instances tail-serial-port-output ${cluster}-${i}-db-${gitlab_env} --zone=${zone} --port=1 ssh-keygen -R $remote_host ssh $remote_host 'dig replica.patroni.service.consul'-
Verify new node and backend is added to the load balancer in the GCP console: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gstg-pgbouncer-regional?project=gitlab-staging-1 -
Verify no increase in error rate here: https://nonprod-log.gitlab.net/goto/a461dbc0-6a09-11ed-9af2-6131f0ee4ce6 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1 -
Verify no increase in error rate here: https://dashboards.gitlab.net/d/api-main/api-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&var-stage=main -
Verify pgbouncer client distribution among pgbouncers here: https://thanos.gitlab.net/graph?g0.expr=sum(pgbouncer_used_clients%7Benv%3D%22gstg%22%2Ctype%3D~%22pgbouncer%22%7D)%20by%20(fqdn%2Cjob)&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 15 minutes
-
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2505 to restore pre-maintenance pool_sizevalues -
Check OS versions on all nodes knife ssh "$chef_filter" "grep '^VERSION=' /etc/os-release -
Re-enable chef-client knife ssh "${chef_filter}" "sudo chef-client-enable"
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 45 minutes
-
Update pgbouncer pool_sizeagain if https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2505 has been reverted -
Drain connections and stop pgbouncer -
Revert terraform MR config-mgmt!3332 to switch boot image back to 16.04 -
Taint the terraform resource for each instance (working serially, following the maintenance plan above) -
Perform a targeted apply for the tainted instance, to re-provision the instance with the old boot image -
Execute post-change/validation steps
Monitoring
Key metrics to observe
- Metric: Server Connection Pool Active Connections per Node
- Location:
- Monitor this panel and related metrics under the
pgbouncer Workloadsection of the dashboard for changes in connection counts as nodes are removed from/restored to service
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.