[GPRD] Rollout decomposed CI "Phase 4" pgbouncer migration

`Production` Change

Change Summary

Equivalent staging change at #6370 (closed)

In Phase 4 of our CI decomposition we change the rails application to start using a new connection for read-write queries. This new read-write connection will point to a new set of PGBouncer hosts (PGBouncer CI we call this). These PGBouncer hosts, however, will still be pointing to the main Patroni cluster as we are not yet fully ready to decompose the CI database.

This step effectively gets us to the point, however, where the application fully thinks it is reading and writing 2 independent databases. It just happens they are still the same database which reduces risk considerably as there is no possibility of split-brain and we can easily revert if the application runs into bugs with 2 separate connections.

Prior to this we are in Phase 3 where GitLab is only using the new CI Patroni cluster for "read-only" queries which is a separate codepath in GitLab to handle known delayed replicas.

One additional complexity in this step of the rollout is migrating connections from the old PGBouncer hosts to the new PGBouncer hosts without exceeding the total limit of connections to the primary. You can read more at gitlab-org/gitlab#347203 (closed) and the comments in this issue but it was determined that we are able to safely do this during low-usage hours and migrate only small percentages of connections at a time. That's why this change request only gets up to 15% of traffic and 5 connections per PGBouncer host. After that we plan to use the data from this smaller increment to figure out the next safe increment. On staging this will likely be very easy to jump to 100% as usage is low but on production we'll need to be more careful with increment size.

Change Details

Services Impacted - ServicePgbouncer ServiceAPI ServiceWeb ServicePostgres Database
Change Technician - @rhenchen.gitlab @DylanGriffith
Change Reviewer -
Time tracking - 5 days
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - ~~10 minutes~~ at least 2 hours (due to rails config upgrade, step 5)

Do all the below steps on staging or some other environment with load to verify PGBouncer behaves sensibly (ie. it closes connections in a reasonably timely manner)
1. #6370 (closed)
Set label changein-progress on this issue
Remove the _ci suffix from sidekiq ci pgbouncer pool name
1. MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1601
Set pool size limit of new PGBouncer hosts to 1 for Web/API and Sidekiq
1. MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1601 (same MR as previous step)
[optional] clone https://gitlab.com/rhenchen.gitlab/rhenchen/-/tree/main/scripts and get familiar with the ssh_cluster_regex.sh script

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 0 minutes

After bumping the workload you can validate the clients and server connections into pgbouncers:
- To validate connections from pgbouncer into the database:
  - For CI execute ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo pgb-console -c \"SHOW SERVERS;\"" for all nodes
  - For MAIN execute ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gprd" "sudo pgb-console -c \"SHOW SERVERS;\"" for all nodes
  - Mind only gitlab user connections (should not be more than pool_size):
- To validate connections coming from the application into pgbouncer:
  - For CI execute ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gprd" "sudo pgb-console -c \"SHOW CLIENTS;\""
  - Mind only gitlab user connections;

Rollback

(1) Consider just increasing the pool size of the saturated PGBouncer

If you have just a single PGBouncer pool the consider first just increasing the pool size for the saturated PGBouncer configuration. We have increased the user limit to 360 across all PGBouncer pools so just ensure that the sum across all pools does not exceed this.

To do this:

Consider the following metrics:
1. PGBouncer CI Overview => In particular the Active Backend Server Connections per Database chart and Total Connection Wait Time chart
2. PGBouncer Overview => In particular the Active Backend Server Connections per Database chart and Total Connection Wait Time chart
Observe where there is higher saturation and compare Active Backend Server Connections to total allowed per pool. Since there are 3 hosts for each pool you need to multiply the following configuration values by 3 to see the total allowed:
Make a merge request to increase the pools across the hosts to reduce saturation
Merge the merge request
Run chef-client to force updating on all the affected PGBouncer hosts

(2) Consider rebalancing connection pools first

If the problem that is prompting a need for rollback is connection pool saturation on one pool of PGBouncer hosts then consider instead rebalancing some connections from underutilized pools. The biggest challenge with this rollout has been predicting ahead of time how to re-allocate connections from 1 pool to another while migrating traffic across the pools. As such you can consider rebalancing if we got some of our estimates wrong.

To do this:

Consider the following metrics:
1. PGBouncer CI Overview => In particular the Active Backend Server Connections per Database chart and Total Connection Wait Time chart
2. PGBouncer Overview => In particular the Active Backend Server Connections per Database chart and Total Connection Wait Time chart
Observe where there is lower saturation and compare Active Backend Server Connections to total allowed per pool. Since there are 3 hosts for each pool you need to multiply the following configuration values by 3 to see the total allowed:
Make a merge request to re-balance the pools across the hosts to reduce saturation
Merge the merge request
Run chef-client to force updating on all the affected PGBouncer hosts

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 20 minutes

If none of the pool size adjustments can help resolve the problem them you can rollback using the below procedure.

The below rollback process should be appropriate if there is some bug with our new CI PGBouncer hosts and it brings us back to the state where we were before starting this change request. It may not be the best option to use in the event that we find ourselves somehow saturating connections to the primary Postgres instance as the below process is going to increase total connection limits before decreasing them. Such an event is seemingly not possible given our analysis anyway so the below process should usually work. If for some reason we managed to saturate our primary Postgres server connection limit then we might instead consider swapping the below steps such that we first disable the feature flag, then decrease connection limits on CI PGbouncer pool before finally increasing connection limits on the main PGBouncer pool (just reorder the same steps).

Increase total connections on old PGBouncer hosts by reverting https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1617
Force Chef client to upgrade changes:
- Execute: ssh_cluster_regex.sh "pgbouncer.*gprd" "sudo chef-client"
Check pools sizing for all pgbouncer nodes in gprd
- Execute ssh_cluster_regex.sh "pgbouncer.*gprd" "sudo pgb-console -c \"SHOW DATABASES;\"" (check pool_size)
Disable the feature flag
- /chatops run feature set force_no_sharing_primary_model 0 --random
Decrease the connection pool sizes on PGBouncer CI by reverting https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1654
Force Chef client to upgrade changes:
- Execute: ssh_cluster_regex.sh "pgbouncer.*gprd" "sudo chef-client"
Check pools sizing for all pgbouncer nodes in gprd
- Execute ssh_cluster_regex.sh "pgbouncer.*gprd" "sudo pgb-console -c \"SHOW DATABASES;\"" (check pool_size)

Monitoring

Key metrics to observe

Metric: Primary Total Connections & Activity Total
- Location: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&var-prometheus=Global&var-environment=gprd&var-type=patroni
- What changes to this metric should prompt a rollback: Exceeding 400 Primary Total Connections
Metric: Sentry Errors
- Location: https://sentry.gitlab.net/gitlab/gitlabcom/
- What changes to this metric should prompt a rollback: New errors likely related to this change (timing and related to database connections)
PGBouncer Main
- Metric: PGBouncer Main Error Ratio
  - Location: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&viewPanel=2040732717&var-PROMETHEUS_DS=Global&var-environment=gprd
  - What changes to this metric should prompt a rollback: High error ratio > 0.1% (for more than 10 minutes)
- Metric: PGBouncer Main Saturation
  - Location: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&viewPanel=55&var-PROMETHEUS_DS=Global&var-environment=gprd
  - What changes to this metric should prompt a rollback: > 90% Saturation of any resource (for more than 10 minutes)
PGBouncer CI
- Metric: PGBouncer CI Error Ratio
  - Location: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&viewPanel=2120868752&var-PROMETHEUS_DS=Global&var-environment=gprd
  - What changes to this metric should prompt a rollback: High error ratio > 0.1% (for more than 10 minutes)
- Metric: PGBouncer CI Saturation
  - Location: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&viewPanel=541&var-PROMETHEUS_DS=Global&var-environment=gprd
  - What changes to this metric should prompt a rollback: > 90% Saturation of any resource (for more than 10 minutes)

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

No, but re-size the pgbouncer database pool sizes

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Apr 28, 2022 by Dylan Griffith