[GPRD] Rollout decomposed CI "Phase 4" pgbouncer migration
Production
Change
Change Summary
Equivalent staging change at #6370 (closed)
In Phase 4 of our CI decomposition we change the rails application to start using a new connection for read-write queries. This new read-write connection will point to a new set of PGBouncer hosts (PGBouncer CI we call this). These PGBouncer hosts, however, will still be pointing to the main
Patroni cluster as we are not yet fully ready to decompose the CI database.
This step effectively gets us to the point, however, where the application fully thinks it is reading and writing 2 independent databases. It just happens they are still the same database which reduces risk considerably as there is no possibility of split-brain and we can easily revert if the application runs into bugs with 2 separate connections.
Prior to this we are in Phase 3 where GitLab is only using the new CI Patroni cluster for "read-only" queries which is a separate codepath in GitLab to handle known delayed replicas.
One additional complexity in this step of the rollout is migrating connections from the old PGBouncer hosts to the new PGBouncer hosts without exceeding the total limit of connections to the primary. You can read more at gitlab-org/gitlab#347203 (closed) and the comments in this issue but it was determined that we are able to safely do this during low-usage hours and migrate only small percentages of connections at a time. That's why this change request only gets up to 15% of traffic and 5 connections per PGBouncer host. After that we plan to use the data from this smaller increment to figure out the next safe increment. On staging this will likely be very easy to jump to 100% as usage is low but on production we'll need to be more careful with increment size.
Change Details
- Services Impacted - ServicePgbouncer ServiceAPI ServiceWeb ServicePostgres Database
- Change Technician - @rhenchen.gitlab @DylanGriffith
- Change Reviewer -
- Time tracking - 5 days
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 10 minutes at least 2 hours (due to rails config upgrade, step 5)
-
Do all the below steps on staging or some other environment with load to verify PGBouncer behaves sensibly (ie. it closes connections in a reasonably timely manner) -
Set label changein-progress on this issue -
Remove the _ci
suffix from sidekiq ci pgbouncer pool name -
Set pool size limit of new PGBouncer hosts to 1
for Web/API and Sidekiq- MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1601 (same MR as previous step)
-
[optional] clone https://gitlab.com/rhenchen.gitlab/rhenchen/-/tree/main/scripts and get familiar with the ssh_cluster_regex.sh
script
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60 minutes
-
During quiet working hours (~08:00 UTC) -
Set primary host for ci
inconfig/database.yml
to the new CI PGBouncer ashost
for theci
connection as well as overriding the value in the console node to ensure that this still points directly at the Patroni primary and setdatabase_tasks: false
for theci
configuration- MR (chef): https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1604
- Corresponds to staging MRs https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1548 and https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1581 and the staging changes in #6730 (closed) and these can/should be combined into a single MR for production
- MR (canary): gitlab-com/gl-infra/k8s-workloads/gitlab-com!1666 (merged)
- Corresponds to staging MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!1604 (merged) and staging changes in #6730 (closed) and these can/should be combined into a single MR for production
- MR (gprd): gitlab-com/gl-infra/k8s-workloads/gitlab-com!1667 (merged)
- Corresponds to staging MR gitlab-com/gl-infra/k8s-workloads/gitlab-com!1605 (merged) and staging changes in #6730 (closed) and these can/should be combined into a single MR for production
- MR (chef): https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1604
-
Confirm which host is the Patroni Writer - Primary Host:
-
Confirm number of connections is below threshold 400 connections - - https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&viewPanel=17&var-prometheus=Global&var-environment=gprd&var-type=patroni
- https://thanos-query.ops.gitlab.net/graph?g0.expr=pg_stat_database_numbackends%7Benv%3D%22gprd%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
Enable 1% of CI related queries to go to new PGBouncer hosts - Slack Command:
/chatops run feature set force_no_sharing_primary_model 1 --random
- Slack Command:
-
Set back to 0% as we discovered a TCP load balancer misconfiguration #6440 (comment 912153473) and updating it requires removing and re-adding the TCP load balancer which would cause errors for 1% of traffic while it's ongoing so we disable the feature flag while doing this. /chatops run feature set force_no_sharing_primary_model 0 --random --ignore-production-check
-
Fix the TCP load balancer configuration https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/3658 as discovered in #6440 (comment 912153473) -
Set pool size limit of new PGBouncer hosts to 2
for Web/API and Sidekiq -
Enable 1% of CI related queries to go to new PGBouncer hosts - Slack Command:
/chatops run feature set force_no_sharing_primary_model 1 --random
- Slack Command:
-
Observe new CI PGBouncer hosts start accepting new writes and opening connections to Postgres. They should not exceed a total of 6 connections across all 6 hosts - Dashboard: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd
- Execute
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo pgb-console -c \"SHOW SERVERS;\""
for all nodes
-
WAIT 24hrs to be sure nothing is wrong -
At non-peak hours -
Set pool size limit of new PGBouncer hosts to 5
for Web/API and Sidekiq -
Re-run chef on all PGBouncer CI hosts - Execute:
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo chef-client"
- Confirm:
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute:
-
Enable 15% of CI related queries to go to new PGBouncer hosts - Slack Command:
/chatops run feature set force_no_sharing_primary_model 15 --random
- Slack Command:
-
Observe new CI PGBouncer hosts start accepting new writes and opening connections to Postgres. They should not exceed a total of 30 connections across all 6 hosts (5 connections per node) -
Set reserve_pool=0
for Main PGBouncer hosts if they have a reserve pool higher than 0 (zero)- Execute
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
to check thereserve_pool
size in all nodes - If
reserve_pool
> 0 then addreserve_pool=0
into the pools setup (same Chef roles in the next step)
- Execute
-
Decrease total connections on Main PGBouncer hosts by 5 (assuming the data suggests this is fine) -
Re-run chef on all PGBouncer CI hosts - Execute:
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gprd" "sudo chef-client"
- Confirm:
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute:
-
Observe some connections are now closed on main PGBouncer hosts after a few minutes (or observe that the number open was already below the new limit) -
Use metrics from above steps to figure out next increment sizes (how much % to enable and how many connections to move) -
Push to 50% CI Workload -
At non-peak hours -
Contact @sre-oncall -
Set label changein-progress on this issue -
Increase pool size limit of CI PGBouncer hosts -
Re-run chef on all PGBouncer CI hosts - Execute:
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo chef-client"
- Confirm:
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute:
-
Enable 50% of CI related queries to go to new PGBouncer hosts - Slack Command:
/chatops run feature set force_no_sharing_primary_model 50 --random --ignore-production-check
- Slack Command:
-
Observe new CI PGBouncer hosts start accepting new writes and opening connections to Postgres -
Decrease total connections on Main PGBouncer hosts -
Re-run chef on all PGBouncer CI hosts - Execute:
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gprd" "sudo chef-client"
- Confirm:
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute:
-
Observe some connections are now closed on main PGBouncer hosts after a few minutes (or observe that the number open was already below the new limit) -
Rebalance to 50% CI Workload -
Increase pool size limit of CI PGBouncer hosts, decrease Main Web/API pool -
Re-run chef on all PGBouncer CI hosts - Execute:
ssh_cluster_regex.sh "(pgbouncer-|pgbouncer-sidekiq-).*gprd" "sudo chef-client"
- Confirm:
ssh_cluster_regex.sh "(pgbouncer-|pgbouncer-sidekiq-).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute:
-
Push to 100% CI Workload -
Increase gitlab
user connection limit in Postgres so we have a buffer in case we need to increase a pool size during some peak saturation -
At non-peak hours -
Contact @sre-oncall -
Set label changein-progress on this issue -
Increase pool size limit of CI PGBouncer hosts -
Re-run chef on all PGBouncer CI hosts - Execute:
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo chef-client"
- Confirm:
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute:
-
Enable 100% of CI related queries to go to new PGBouncer hosts - Slack Command:
/chatops run feature set force_no_sharing_primary_model 100 --random --ignore-production-check
- Slack Command:
-
Observe new CI PGBouncer hosts start accepting new writes and opening connections to Postgres -
Decrease total connections on Main PGBouncer hosts -
Re-run chef on all PGBouncer CI hosts - Execute:
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gprd" "sudo chef-client"
- Confirm:
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute:
-
Observe some connections are now closed on main PGBouncer hosts after a few minutes (or observe that the number open was already below the new limit)
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 0 minutes
- After bumping the workload you can validate the clients and server connections into pgbouncers:
- To validate connections from pgbouncer into the database:
- For CI execute
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo pgb-console -c \"SHOW SERVERS;\""
for all nodes - For MAIN execute
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gprd" "sudo pgb-console -c \"SHOW SERVERS;\""
for all nodes - Mind only
gitlab
user connections (should not be more than pool_size):
- For CI execute
- To validate connections coming from the application into pgbouncer:
- For CI execute
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo pgb-console -c \"SHOW CLIENTS;\""
- Mind only
gitlab
user connections;
- For CI execute
- To validate connections from pgbouncer into the database:
Rollback
(1) Consider just increasing the pool size of the saturated PGBouncer
If you have just a single PGBouncer pool the consider first just increasing the pool size for the saturated PGBouncer configuration. We have increased the user limit to 360 across all PGBouncer pools so just ensure that the sum across all pools does not exceed this.
To do this:
- Consider the following metrics:
-
PGBouncer CI Overview => In particular the
Active Backend Server Connections per Database
chart andTotal Connection Wait Time
chart -
PGBouncer Overview => In particular the
Active Backend Server Connections per Database
chart andTotal Connection Wait Time
chart
-
PGBouncer CI Overview => In particular the
- Observe where there is higher saturation and compare
Active Backend Server Connections
to total allowed per pool. Since there are 3 hosts for each pool you need to multiply the following configuration values by 3 to see the total allowed: - Make a merge request to increase the pools across the hosts to reduce saturation
- Merge the merge request
- Run
chef-client
to force updating on all the affected PGBouncer hosts
(2) Consider rebalancing connection pools first
If the problem that is prompting a need for rollback is connection pool saturation on one pool of PGBouncer hosts then consider instead rebalancing some connections from underutilized pools. The biggest challenge with this rollout has been predicting ahead of time how to re-allocate connections from 1 pool to another while migrating traffic across the pools. As such you can consider rebalancing if we got some of our estimates wrong.
To do this:
- Consider the following metrics:
-
PGBouncer CI Overview => In particular the
Active Backend Server Connections per Database
chart andTotal Connection Wait Time
chart -
PGBouncer Overview => In particular the
Active Backend Server Connections per Database
chart andTotal Connection Wait Time
chart
-
PGBouncer CI Overview => In particular the
- Observe where there is lower saturation and compare
Active Backend Server Connections
to total allowed per pool. Since there are 3 hosts for each pool you need to multiply the following configuration values by 3 to see the total allowed: - Make a merge request to re-balance the pools across the hosts to reduce saturation
- Merge the merge request
- Run
chef-client
to force updating on all the affected PGBouncer hosts
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 20 minutes
If none of the pool size adjustments can help resolve the problem them you can rollback using the below procedure.
The below rollback process should be appropriate if there is some bug with our new CI PGBouncer hosts and it brings us back to the state where we were before starting this change request. It may not be the best option to use in the event that we find ourselves somehow saturating connections to the primary Postgres instance as the below process is going to increase total connection limits before decreasing them. Such an event is seemingly not possible given our analysis anyway so the below process should usually work. If for some reason we managed to saturate our primary Postgres server connection limit then we might instead consider swapping the below steps such that we first disable the feature flag, then decrease connection limits on CI PGbouncer pool before finally increasing connection limits on the main PGBouncer pool (just reorder the same steps).
-
Increase total connections on old PGBouncer hosts by reverting https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1617 -
Force Chef client to upgrade changes: - Execute:
ssh_cluster_regex.sh "pgbouncer.*gprd" "sudo chef-client"
- Execute:
-
Check pools sizing for all pgbouncer nodes in gprd - Execute
ssh_cluster_regex.sh "pgbouncer.*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute
-
Disable the feature flag /chatops run feature set force_no_sharing_primary_model 0 --random
-
Decrease the connection pool sizes on PGBouncer CI by reverting https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1654 -
Force Chef client to upgrade changes: - Execute:
ssh_cluster_regex.sh "pgbouncer.*gprd" "sudo chef-client"
- Execute:
-
Check pools sizing for all pgbouncer nodes in gprd - Execute
ssh_cluster_regex.sh "pgbouncer.*gprd" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute
Monitoring
Key metrics to observe
- Metric: Primary Total Connections & Activity Total
- Location: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&var-prometheus=Global&var-environment=gprd&var-type=patroni
- What changes to this metric should prompt a rollback: Exceeding 400 Primary Total Connections
- Metric: Sentry Errors
- Location: https://sentry.gitlab.net/gitlab/gitlabcom/
- What changes to this metric should prompt a rollback: New errors likely related to this change (timing and related to database connections)
- PGBouncer Main
- Metric: PGBouncer Main Error Ratio
- Location: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&viewPanel=2040732717&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: High error ratio >
0.1%
(for more than 10 minutes)
- Metric: PGBouncer Main Saturation
- Location: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&viewPanel=55&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: > 90% Saturation of any resource (for more than 10 minutes)
- Metric: PGBouncer Main Error Ratio
- PGBouncer CI
- Metric: PGBouncer CI Error Ratio
- Location: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&viewPanel=2120868752&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: High error ratio >
0.1%
(for more than 10 minutes)
- Metric: PGBouncer CI Saturation
- Location: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&viewPanel=541&var-PROMETHEUS_DS=Global&var-environment=gprd
- What changes to this metric should prompt a rollback: > 90% Saturation of any resource (for more than 10 minutes)
- Metric: PGBouncer CI Error Ratio
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
No, but re-size the pgbouncer database pool sizes
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). -
The change plan includes success measures for all steps/milestones during the execution. -
The change adequately minimizes risk within the environment/service. -
The performance implications of executing the change are well-understood and documented. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? -
The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.