[GSTG] Rollout decomposed CI "Phase 4" pgbouncer migration

`Staging` Change

Change Summary

In Phase 4 of our CI decomposition we change the rails application to start using a new connection for read-write queries. This new read-write connection will point to a new set of PGBouncer hosts (PGBouncer CI we call this). These PGBouncer hosts, however, will still be pointing to the main Patroni cluster as we are not yet fully ready to decompose the CI database.

This step effectively gets us to the point, however, where the application fully thinks it is reading and writing 2 independent databases. It just happens they are still the same database which reduces risk considerably as there is no possibility of split-brain and we can easily revert if the application runs into bugs with 2 separate connections.

Prior to this we are in Phase 3 where GitLab is only using the new CI Patroni cluster for "read-only" queries which is a separate codepath in GitLab to handle known delayed replicas.

One additional complexity in this step of the rollout is migrating connections from the old PGBouncer hosts to the new PGBouncer hosts without exceeding the total limit of connections to the primary. You can read more at gitlab-org/gitlab#347203 (closed) and the comments in this issue but it was determined that we are able to safely do this during low-usage hours and migrate only small percentages of connections at a time. That's why this change request only gets up to 15% of traffic and 5 connections per PGBouncer host. After that we plan to use the data from this smaller increment to figure out the next safe increment. On staging this will likely be very easy to jump to 100% as usage is low but on production we'll need to be more careful with increment size.

Change Details

Services Impacted - ServicePgbouncer ServiceAPI ServiceWeb ServicePostgres Database
Change Technician - @rhenchen.gitlab @DylanGriffith
Change Reviewer - @Finotto
Time tracking - 5 days
Downtime Component - None

Detailed steps for the change

Pre-Change Steps - steps to be completed before execution of the change

Estimated Time to Complete (mins) - ~~10 minutes~~ at least 2 hours (due to rails config upgrade, step 5)

Set pool size limit of new PGBouncer hosts to 1 for Web/API and Sidekiq
1. MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1425
Test our tool for generating load and update change steps below with the command we want to run #6370 (comment 879738859)
Check that gitlab-org/gitlab!83162 (merged) is deployed to staging
Set label changein-progress on this issue
Set primary host for ci in config/database.yml to the new CI PGBouncer as host for the ci connection
1. MR (chef): https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1548
  1. Should correspond to https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/blob/f06fb3b7ab8d50a4b991ed41779df011598d9d4d/roles/gstg-base.json#L385 and it looks based on chef code it should be under ci:db_host and should take the value pgbouncer-ci.int.gstg.gitlab.net
2. MR (canary): gitlab-com/gl-infra/k8s-workloads/gitlab-com!1604 (merged)
  1. ~~Should be similar to the psql:ci section of gitlab-com/gl-infra/k8s-workloads/gitlab-com!1377 (diffs) except we need to set host to pgbouncer-ci.int.gstg.gitlab.net~~
3. MR (gprd): gitlab-com/gl-infra/k8s-workloads/gitlab-com!1605 (merged)
  1. Should be similar to the psql:ci sections of gitlab-com/gl-infra/k8s-workloads/gitlab-com!1378 (merged) and gitlab-com/gl-infra/k8s-workloads/gitlab-com!1381 (merged) except we need to set global.psql.ci.host: pgbouncer-ci.int.gstg.gitlab.net and sidekiq.psql.ci.host: pgbouncer-sidekiq-ci.int.gstg.gitlab.net
Confirm which host is the Patroni Writer
1. Primary Host:
Confirm number of connections is below threshold 400 connections -
1. https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&viewPanel=17&var-prometheus=Global&var-environment=gstg&var-type=patroni
2. https://thanos-query.ops.gitlab.net/graph?g0.expr=pg_stat_database_numbackends%7Benv%3D%22gstg%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=2h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[optional] clone https://gitlab.com/rhenchen.gitlab/rhenchen/-/tree/main/scripts and get familiar with the ssh_cluster_regex.sh script

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes

Post-Change Steps - steps to take to verify the change

Estimated Time to Complete (mins) - 0 minutes

After bumping the workload you can validate the clients and server connections into pgbouncers:
- To validate connections from pgbouncer into the database:
  - For CI execute ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gstg" "sudo pgb-console -c \"SHOW SERVERS;\"" for all nodes
  - For MAIN execute ssh_cluster_regex.sh "(pgbouncer-0|pgbouncer-sidekiq-0).*gstg" "sudo pgb-console -c \"SHOW SERVERS;\"" for all nodes
  - Mind only gitlab user connections (should not be more than pool_size):
- To validate connections coming from the application into pgbouncer:
  - For CI execute ssh_cluster_regex.sh "(pgbouncer-ci|pgbouncer-sidekiq-ci).*gstg" "sudo pgb-console -c \"SHOW CLIENTS;\""
  - Mind only gitlab user connections;

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 10 minutes

Increase total connections on old PGBouncer hosts
- MR: revert all executed MRs to the last stable state
Force Chef client to upgrade changes:
- Execute: ssh_cluster_regex.sh "pgbouncer.*gstg" "sudo chef-client"
Check pools sizing for all pgbouncer nodes in gstg
- Execute ssh_cluster_regex.sh "pgbouncer.*gstg" "sudo pgb-console -c \"SHOW DATABASES;\"" (check pool_size)
Disable the feature flag
- /chatops run feature set force_no_sharing_primary_model 0 --staging --random

Monitoring

Key metrics to observe

Metric: Primary Total Connections & Activity Total
- Location: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&var-prometheus=Global&var-environment=gstg&var-type=patroni
- What changes to this metric should prompt a rollback: Exceeding 400 Primary Total Connections
Metric: Sentry Errors
- Location: https://sentry.gitlab.net/gitlab/gitlabcom/
- What changes to this metric should prompt a rollback: New errors likely related to this change (timing and related to database connections)
PGBouncer Main
- Metric: PGBouncer Main Error Ratio
  - Location: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&viewPanel=2040732717&var-PROMETHEUS_DS=Global&var-environment=gstg
  - What changes to this metric should prompt a rollback: High error ratio > 0.1% (for more than 10 minutes)
- Metric: PGBouncer Main Saturation
  - Location: https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1&viewPanel=55&var-PROMETHEUS_DS=Global&var-environment=gstg
  - What changes to this metric should prompt a rollback: > 90% Saturation of any resource (for more than 10 minutes)
PGBouncer CI
- Metric: PGBouncer CI Error Ratio
  - Location: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&viewPanel=2120868752&var-PROMETHEUS_DS=Global&var-environment=gstg
  - What changes to this metric should prompt a rollback: High error ratio > 0.1% (for more than 10 minutes)
- Metric: PGBouncer CI Saturation
  - Location: https://dashboards.gitlab.net/d/pgbouncer-ci-main/pgbouncer-ci-overview?orgId=1&viewPanel=541&var-PROMETHEUS_DS=Global&var-environment=gstg
  - What changes to this metric should prompt a rollback: > 90% Saturation of any resource (for more than 10 minutes)

Summary of infrastructure changes

Does this change introduce new compute instances?
Does this change re-size any existing compute instances?
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

No, but re-size the pgbouncer database pool sizes

Change Reviewer checklist

C4 C3 C2 C1:

The scheduled day and time of execution of the change is appropriate.
The change plan is technically accurate.
The change plan includes estimated timing values based on previous testing.
The change plan includes a viable rollback plan.
The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
The change plan includes success measures for all steps/milestones during the execution.
The change adequately minimizes risk within the environment/service.
The performance implications of executing the change are well-understood and documented.
The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
The change has a primary and secondary SRE with knowledge of the details available during the change window.

Change Technician checklist

Edited Mar 31, 2022 by Dylan Griffith