2022-08-02: Increase max_client_conn for pgbouncer primary

Production Change

Change Summary

We are saturating max_client_conn for the Primary (read/write) hosts

Source

Increase the max_client_conn for the pgbouncer[-ci]-{0,3} by 22% for the primary connections (read, writes).

Before: 24576 = (8192 * 3) After: 30000 = (10000 * 3)

Change Details

Services Impacted - ServicePgbouncer
Change Technician - @steveazz
Change Reviewer - @rhenchen.gitlab
Time tracking - 30 minutes
Downtime Component - none

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 25

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 5

Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2159
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: pgbouncer CPU usage
- Location: https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(namedprocess_namegroup_cpu_seconds_total%7Benv%3D%22gprd%22%2Cfqdn%3D~%22pgbouncer.*%22%2Cgroupname%3D~%22pgbouncer.%2B%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: Hitting 100% CPU usage
Metric: Primary CPU usage main
- Location: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=gprd&var-node=patroni-v12-03-db-gprd.c.gitlab-production.internal
- What changes to this metric should prompt a rollback: Increase in CPU usage
Metric: Primary CPU usage ci
- Location: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=gprd&var-node=patroni-ci-04-db-gprd.c.gitlab-production.internal
- What changes to this metric should prompt a rollback: Increase in CPU usage
Metric: Total Time in Queries Per Node
- Location: https://dashboards.gitlab.net/explore?left=%7B%22datasource%22:%22Global%22,%22queries%22:%5B%7B%22expr%22:%22sum(rate(pgbouncer_stats_queries_duration_seconds_total%7Btype%3D~%5C%22pgbouncer.*%5C%22,%20environment%3D%5C%22gprd%5C%22%7D%5B$__interval%5D))%20by%20(fqdn)%5Cn%22,%22format%22:%22time_series%22,%22interval%22:%221m%22,%22intervalFactor%22:2,%22refId%22:%22B%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22PA258B30F88C30650%22%7D,%22hide%22:false%7D%5D,%22range%22:%7B%22from%22:%22now-6h%2Fm%22,%22to%22:%22now%2Fm%22%7D%7D&orgId=1
- What changes to this metric should prompt a rollback: Increase in query time
Metric: Total Connection Wait Time
- Location: https://dashboards.gitlab.net/explore?left=%7B%22datasource%22:%22Global%22,%22queries%22:%5B%7B%22expr%22:%22sum%20by%20(database,%20environment,%20type)%20(rate(pgbouncer_stats_client_wait_seconds_total%7Btype%3D~%5C%22pgbouncer.%5C%22,%20environment%3D%5C%22gprd%5C%22,%20database!%3D%5C%22pgbouncer%5C%22%7D%5B$__interval%5D)%20%2F%20on()%20group_left()%20(vector((time()%20%3C%20bool%201588233600)%20%201000000)%20%3D%3D%201000000%20or%20vector(1)))%5Cn%22,%22format%22:%22time_series%22,%22interval%22:%221m%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22PA258B30F88C30650%22%7D%7D%5D,%22range%22:%7B%22from%22:%22now-6h%2Fm%22,%22to%22:%22now%2Fm%22%7D%7D&orgId=1
- What changes to this metric should prompt a rollback: Increase in connection wait time

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Aug 04, 2022 by Steve Xuereb