[GSTG] - Upgrade RW PgBouncers to 1.19.1 and increase listen_backlog
Production Change
Change Summary
These change was proposed at https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23988
For this change we expect to upgrade all Read-Write (Primary) pgbouncers to version 1.19.1 and tune pgbouncer and kernel settings to allow pgbouncer to hold larger amount of connections during PAUSED as discussed here.
The changes are the following:
- PgBouncer
- version : 1.19.1
- listen_backlog = 8192
- Linux kernel:
- net.core.somaxconn=65535
- net.ipv4.tcp_max_syn_backlog=8192
- net.ipv4.ip_local_port_range=10000 65535
- net.ipv4.tcp_tw_reuse=1
- net.netfilter.nf_conntrack_max=1048576
Downtime is not required even that we have to restart the pgbouncer processes to apply the listen_backlog
setting, because we'll rollout the change in one VM at a time and our load balancer should route connections to the available pgbouncer instances, therefore we don't expect impact on the application.
We don't expect any performance regression by this change as we benchmarked both current/old and new pgbouncer setups in db-benchmarking
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/23988#note_1503782653
Change Details
- Services Impacted - ServicepgbouncerCI ServicepgbouncerCI ServicePgbouncerRegistry
- Change Technician - @rhenchen.gitlab
- Change Reviewer - @alexander-sosna @bshah11
- Time tracking - 4 hours
- Downtime Component - no downtime
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 120 minutes
-
Set label changein-progress /label ~change::in-progress
-
Silence alerts in pgbouncer fqdn for 120 minutes -
Establish tmux session on bastion and note details in a comment on this issue ssh -A bastion-01-inf-db-gprd.c.gitlab-db-production.internal tmux -L 'CR16148'
-
Setup the temporary working directory and environment for this CR on both bastion and workstation export chef_filter='roles:gstg-base-db-pgbouncer-pool' export gitlab_env=gstg export gitlab_project=gitlab-staging-1 export issue_id='16148' export workdir="/tmp/cr${issue_id}" # bastion only mkdir -p ${workdir} cd ${workdir}
-
Edit a file named check.sh
and add the following linesecho "--------> $remote_host" ssh $remote_host "sudo /usr/local/bin/pgbouncer --version" ssh $remote_host "sudo pgb-console -c 'show version;'" ssh $remote_host "sudo pgb-console -c 'SHOW CONFIG;' | grep listen_backlog" ssh $remote_host "sudo sysctl -a | grep -E 'somaxconn|tcp_max_syn_backlog|ip_local_port_rang|tcp_tw_reuse|nf_conntrack_max|netdev_max_backlog'" ssh $remote_host "sudo systemctl status consul | grep opt"
Note: all subsequent steps are executed from $workdir in tmux on the bastion host, unless otherwise specified
-
(from workstation) Disable chef-client on the target nodes to prevent unwanted changes during the maintenance window (from bastion) knife ssh "${chef_filter}" "sudo chef-client-disable 'Suppressing chef-client execution during maintenance (Upgrade and Tune RW PgBouncers), see: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}'"
-
Apply MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3736 -
pgbouncer-01-db-gstg.c.gitlab-staging-1.internal -
Set/verify target hostname export i=01 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal"
-
Enable Chef and Run chef-client ssh $remote_host "sudo chef-client-enable && sudo chef-client"
-
Check if new settings were applied . ./check.sh
-
Restart pgbouncer and consul services ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul"
-
Check new pgbouncer settings . ./check.sh
-
(from workstation) Confirm that pgbouncer is healthy and receiving connections
- Check Load Balancer Healthy state and PgBouncer Connection Metric:
export i=01 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
-
Repeat for pgbouncer-ci-01-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=01 export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=01 export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-embedding-01-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=01 export cluster='pgbouncer-embedding' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=01 export cluster='pgbouncer-embedding' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-registry-01-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=01 export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=01 export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-sidekiq-01-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=01 export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=01 export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-sidekiq-ci-01-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=01 export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=01 export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-02-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=02 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=02 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-ci-02-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=02 export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=02 export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-embedding-02-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=02 export cluster='pgbouncer-embedding' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=02 export cluster='pgbouncer-embedding' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-registry-02-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=02 export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=02 export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-sidekiq-02-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=02 export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=02 export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-sidekiq-ci-02-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=02 export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=02 export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-03-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=03 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=03 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-ci-03-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=03 export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=03 export cluster='pgbouncer-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-embedding-03-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=03 export cluster='pgbouncer-embedding' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=03 export cluster='pgbouncer-embedding' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-registry-03-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=03 export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=03 export cluster='pgbouncer-registry' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-sidekiq-03-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=03 export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=03 export cluster='pgbouncer-sidekiq' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-sidekiq-ci-03-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=03 export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh "
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=03 export cluster='pgbouncer-sidekiq-ci' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-04-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=04 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=04 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-05-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=05 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=05 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Repeat for pgbouncer-06-db-gstg.c.gitlab-staging-1.internal (From Tmux) Run chef-client and restart pgbouncer
export i=06 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" ssh $remote_host "sudo chef-client-enable && sudo chef-client" . ./check.sh ssh $remote_host "sudo service pgbouncer restart && sudo systemctl restart consul" . ./check.sh
(From workstation) Check Load Balancer Healthy state and PgBouncer Connection Metric
export i=06 export cluster='pgbouncer' export remote_host="${cluster}-${i}-db-${gitlab_env}.c.${gitlab_project}.internal" open https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/$gitlab_env-$cluster-regional?project=$gitlab_project open https://thanos.gitlab.net/graph?g0.expr=sum(rate(pgbouncer_stats_queries_pooled_total%7Btype%3D%22$cluster%22%2C%20environment%3D%22$gitlab_env%22%2Cfqdn%3D%22$remote_host%22%7D%5B30s%5D))%20by%20(fqdn)&g0.tab=0
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 120 minutes
-
(from workstation) Disable chef-client on the target nodes to prevent unwanted changes during the maintenance window (from bastion) knife ssh "${chef_filter}" "sudo chef-client-disable 'Suppressing chef-client execution during maintenance - Rolling Back CR https://gitlab.com/gitlab-com/gl-infra/production/-/issues/${issue_id}'"
-
Revert MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/3736 -
Repeat steps for pgbouncer-01-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-02-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-03-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-04-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-05-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-06-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-ci-01-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-ci-02-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-ci-03-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-embedding-01-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-embedding-02-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-embedding-03-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-registry-01-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-registry-02-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-registry-03-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-sidekiq-01-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-sidekiq-02-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-sidekiq-03-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-sidekiq-ci-01-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-sidekiq-ci-02-db-gstg.c.gitlab-staging-1.internal -
Repeat steps for pgbouncer-sidekiq-ci-03-db-gstg.c.gitlab-staging-1.internal -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
MAIN Cluster
-
Metric: rails_primary_sql SLI Apdex - MAIN cluster
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&from=now-1h&to=now&viewPanel=2409561530
- What changes to this metric should prompt a rollback: continuously less than 98% for more than 10 minutes
-
Metric: pgbouncer SLI Error Ratio
- Location: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&from=now-1h&to=now&var-PROMETHEUS_DS=Global&var-environment=gstg&viewPanel=2040732717
- What changes to this metric should prompt a rollback: continuously above than 1.5% for more than 10 minutes
CI Cluster
-
Metric: rails_primary_sql SLI Apdex - CI cluster
- Location: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&viewPanel=2120868752&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: continuously less than 98% for more than 10 minutes
-
Metric: pgbouncer-ci SLI Error Ratio
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&viewPanel=2409561530
- What changes to this metric should prompt a rollback: continuously above than 1.5% for more than 10 minutes
Registry Cluster
-
Metric: transactions_primary SLI Error Ratio - Registry cluster
- Location: https://dashboards.gitlab.net/d/patroni-registry-main/patroni-registry-overview?orgId=1&viewPanel=2278903563&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: continuously less than 98% for more than 10 minutes
-
Metric: pgbouncer-registry Service Error Ratio
- Location: https://dashboards.gitlab.net/d/pgbouncer-registry-main/pgbouncer-registry-overview?orgId=1&viewPanel=1127477889&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: continuously above than 1.5% for more than 10 minutes
Registry Cluster
-
Metric: transactions_primary SLI Error Ratio - Embedding cluster
- Location: https://dashboards.gitlab.net/d/patroni-embedding-main/patroni-embedding-overview?orgId=1&viewPanel=2278903563&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: continuously less than 98% for more than 10 minutes
-
Metric: pgbouncer-embedding Service Error Ratio
- Location: https://dashboards.gitlab.net/d/pgbouncer-embedding-main/pgbouncer-embedding-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg&viewPanel=3121727356
- What changes to this metric should prompt a rollback: continuously above than 1.5% for more than 10 minutes
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.