[GSTG] Rollout decomposed CI "Phase 4" pgbouncer migration
Staging
Change
Change Summary
In Phase 4 of our CI decomposition we change the rails application to start using a new connection for read-write queries. This new read-write connection will point to a new set of PGBouncer hosts (PGBouncer CI we call this). These PGBouncer hosts, however, will still be pointing to the main
Patroni cluster as we are not yet fully ready to decompose the CI database.
This step effectively gets us to the point, however, where the application fully thinks it is reading and writing 2 independent databases. It just happens they are still the same database which reduces risk considerably as there is no possibility of split-brain and we can easily revert if the application runs into bugs with 2 separate connections.
Prior to this we are in Phase 3 where GitLab is only using the new CI Patroni cluster for "read-only" queries which is a separate codepath in GitLab to handle known delayed replicas.
One additional complexity in this step of the rollout is migrating connections from the old PGBouncer hosts to the new PGBouncer hosts without exceeding the total limit of connections to the primary. You can read more at gitlab-org/gitlab#347203 (closed) and the comments in this issue but it was determined that we are able to safely do this during low-usage hours and migrate only small percentages of connections at a time. That's why this change request only gets up to 15% of traffic and 5 connections per PGBouncer host. After that we plan to use the data from this smaller increment to figure out the next safe increment. On staging this will likely be very easy to jump to 100% as usage is low but on production we'll need to be more careful with increment size.
Change Details
- Services Impacted - ServicePgbouncer ServiceAPI ServiceWeb ServicePostgres Database
- Change Technician - @rhenchen.gitlab @DylanGriffith
- Change Reviewer - @Finotto
- Time tracking - 5 days
- Downtime Component - None
Detailed steps for the change
Pre-Change Steps - steps to be completed before execution of the change
Estimated Time to Complete (mins) - 10 minutes at least 2 hours (due to rails config upgrade, step 5)
-
Set pool size limit of new PGBouncer hosts to 1
for Web/API and Sidekiq -
Test our tool for generating load and update change steps below with the command we want to run #6370 (comment 879738859) -
Check that gitlab-org/gitlab!83162 (merged) is deployed to staging -
Set label changein-progress on this issue -
Set primary host for ci
inconfig/database.yml
to the new CI PGBouncer ashost
for theci
connection- MR (chef): https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1548
Should correspond to https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/f06fb3b7ab8d50a4b991ed41779df011598d9d4d/roles/gstg-base.json#L385 and it looks based on chef code it should be underci:db_host
and should take the valuepgbouncer-ci.int.gstg.gitlab.net
- MR (canary): gitlab-com/gl-infra/k8s-workloads/gitlab-com!1604 (merged)
Should be similar to thepsql:ci
section of gitlab-com/gl-infra/k8s-workloads/gitlab-com!1377 (diffs) except we need to sethost
topgbouncer-ci.int.gstg.gitlab.net
- MR (gprd): gitlab-com/gl-infra/k8s-workloads/gitlab-com!1605 (merged)
Should be similar to thepsql:ci
sections of gitlab-com/gl-infra/k8s-workloads/gitlab-com!1378 (merged) and gitlab-com/gl-infra/k8s-workloads/gitlab-com!1381 (merged) except we need to setglobal.psql.ci.host: pgbouncer-ci.int.gstg.gitlab.net
andsidekiq.psql.ci.host: pgbouncer-sidekiq-ci.int.gstg.gitlab.net
- MR (chef): https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1548
-
Confirm which host is the Patroni Writer - Primary Host:
-
Confirm number of connections is below threshold 400 connections - - https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&viewPanel=17&var-prometheus=Global&var-environment=gstg&var-type=patroni
- https://thanos-query.ops.gitlab.net/graph?g0.expr=pg_stat_database_numbackends%7Benv%3D%22gstg%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
-
[optional] clone https://gitlab.com/rhenchen.gitlab/rhenchen/-/tree/main/scripts and get familiar with the ssh_cluster_regex.sh
script
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60 minutes
-
Trigger significant write workload in gstg
before starting the change to ensure that we have some active pgbouncer connections to the primaryTODO BY QUALITY#6370 (comment 879738859) -
Do all the below steps on staging or some other environment with load to verify PGBouncer behaves sensibly (ie. it closes connections in a reasonably timely manner)This is staging -
During quiet working hours (~08:00 UTC) -
Enable 1% of CI related queries to go to new PGBouncer hosts - Slack Command:
/chatops run feature set force_no_sharing_primary_model 1 --staging --random
- Slack Command:
-
Observe new CI PGBouncer hosts start accepting new writes and opening connections to Postgres. They should not exceed a total of 6 connections across all 6 hostsSlightly exceeded on staging but it wasn't really representative as our load testing was heavily CI traffic- Dashboard: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg
- Execute
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gstg" "sudo pgb-console -c \"SHOW SERVERS;\""
for all nodes
-
WAIT 24hrs to be sure nothing is wrongSkipped for staging -
At non-peak hours -
Set pool size limit of new PGBouncer hosts to 5
for Web/API and Sidekiq -
Re-run chef on all PGBouncer CI hosts - Execute:
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gstg" "sudo chef-client"
- Confirm:
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gstg" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute:
-
Enable 15% of CI related queries to go to new PGBouncer hosts - Slack Command:
/chatops run feature set force_no_sharing_primary_model 15 --staging --random
- Slack Command:
-
Observe new CI PGBouncer hosts start accepting new writes and opening connections to Postgres. They should not exceed a total of 30 connections across all 6 hosts (5 connections per node) -
Set reserve_pool=0
for Main PGBouncer hosts if they have a reserve pool higher than 0 (zero)- Execute
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gstg" "sudo pgb-console -c \"SHOW DATABASES;\""
to check thereserve_pool
size in all nodes - If
reserve_pool
> 0 then addreserve_pool=0
into the pools setup (same Chef roles in the next step)
- Execute
-
Decrease total connections on Main PGBouncer hosts by 5 (assuming the data suggests this is fine) -
Re-run chef on all PGBouncer CI hosts - Execute:
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gstg" "sudo chef-client"
- Confirm:
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gstg" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute:
-
Observe some connections are now closed on main PGBouncer hosts after a few minutes (or observe that the number open was already below the new limit) -
Use metrics from above steps to figure out next increment sizes (how much % to enable and how many connections to move) -
Increment to 50% - Perform steps 7 to 15 again but using - Step 8 - Merge MR to increase
pgbouncer-ci
pool size to (13 and 10): https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1583 - Step 10 - Slack Command:
/chatops run feature set force_no_sharing_primary_model 50 --staging --random
- Step 13 - Merge MR to decrease
pgbouncer-main
pool size (38 and 30): https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1584
- Step 8 - Merge MR to increase
-
Check amount of workload in patroni-ci
and compare withpatroni
to decide the pool sizes, consideringmax_connections
for Postgres primary =500
-
Increment to 100% - Perform steps 7 to 15 again but using - Step 8 - Resize pgbouncer-ci and pgbouncer-main pools to a 40-60 % ratio: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1587
- Step 10 - Slack Command:
/chatops run feature set force_no_sharing_primary_model 100 --staging --random
- Step 13 - Merged with Step 8
Post-Change Steps - steps to take to verify the change
Estimated Time to Complete (mins) - 0 minutes
- After bumping the workload you can validate the clients and server connections into pgbouncers:
- To validate connections from pgbouncer into the database:
- For CI execute
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gstg" "sudo pgb-console -c \"SHOW SERVERS;\""
for all nodes - For MAIN execute
ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gstg" "sudo pgb-console -c \"SHOW SERVERS;\""
for all nodes - Mind only
gitlab
user connections (should not be more than pool_size):
- For CI execute
- To validate connections coming from the application into pgbouncer:
- For CI execute
ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gstg" "sudo pgb-console -c \"SHOW CLIENTS;\""
- Mind only
gitlab
user connections;
- For CI execute
- To validate connections from pgbouncer into the database:
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 10 minutes
-
Increase total connections on old PGBouncer hosts - MR: revert all executed MRs to the last stable state
-
Force Chef client to upgrade changes: - Execute:
ssh_cluster_regex.sh "pgbouncer.*gstg" "sudo chef-client"
- Execute:
-
Check pools sizing for all pgbouncer nodes in gstg - Execute
ssh_cluster_regex.sh "pgbouncer.*gstg" "sudo pgb-console -c \"SHOW DATABASES;\""
(checkpool_size
)
- Execute
-
Disable the feature flag /chatops run feature set force_no_sharing_primary_model 0 --staging --random
Monitoring
Key metrics to observe
- Metric: Primary Total Connections & Activity Total
- Location: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&var-prometheus=Global&var-environment=gstg&var-type=patroni
- What changes to this metric should prompt a rollback: Exceeding 400 Primary Total Connections
- Metric: Sentry Errors
- Location: https://sentry.gitlab.net/gitlab/gitlabcom/
- What changes to this metric should prompt a rollback: New errors likely related to this change (timing and related to database connections)
- PGBouncer Main
- Metric: PGBouncer Main Error Ratio
- Location: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&viewPanel=2040732717&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: High error ratio >
0.1%
(for more than 10 minutes)
- Metric: PGBouncer Main Saturation
- Location: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&viewPanel=55&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: > 90% Saturation of any resource (for more than 10 minutes)
- Metric: PGBouncer Main Error Ratio
- PGBouncer CI
- Metric: PGBouncer CI Error Ratio
- Location: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&viewPanel=2120868752&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: High error ratio >
0.1%
(for more than 10 minutes)
- Metric: PGBouncer CI Saturation
- Location: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&viewPanel=541&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: > 90% Saturation of any resource (for more than 10 minutes)
- Metric: PGBouncer CI Error Ratio
Summary of infrastructure changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
No, but re-size the pgbouncer database pool sizes
Change Reviewer checklist
-
The scheduled day and time of execution of the change is appropriate. -
The change plan is technically accurate. -
The change plan includes estimated timing values based on previous testing. -
The change plan includes a viable rollback plan. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). -
The change plan includes success measures for all steps/milestones during the execution. -
The change adequately minimizes risk within the environment/service. -
The performance implications of executing the change are well-understood and documented. -
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? -
The change has a primary and secondary SRE with knowledge of the details available during the change window.
Change Technician checklist
-
This issue has a criticality label (e.g. C1, C2, C3, C4) and a change-type label (e.g. changeunscheduled, changescheduled) based on the Change Management Criticalities. -
This issue has the change technician as the assignee. -
Pre-Change, Change, Post-Change, and Rollback steps and have been filled out and reviewed. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Necessary approvals have been completed based on the Change Management Workflow. -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall
and this issue and await their acknowledgement.) -
Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents.