2022-08-02: Increase max_client_conn for pgbouncer primary
Production Change
Change Summary
We are saturating max_client_conn
for the Primary (read/write) hosts
Increase the max_client_conn
for the pgbouncer[-ci]-{0,3} by 22% for the primary connections (read, writes).
Before: 24576 = (8192 * 3) After: 30000 = (10000 * 3)
- Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16121
- Analysis: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16121#note_1047569957
Change Details
- Services Impacted - ServicePgbouncer
-
Change Technician -
@steveazz
- Change Reviewer - @rhenchen.gitlab
- Time tracking - 30 minutes
- Downtime Component - none
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 25
-
Set label changein-progress /label ~change::in-progress
-
Disable chef-client
on all nodes:knife ssh "roles:gprd-base-db-pgbouncer-pool" -- sudo chef-client-disable 'https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7536'
-
Merge https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2159+ -
Run on sidekiq: -
knife ssh "roles:gprd-base-db-pgbouncer-sidekiq" -- sudo chef-client-enable
-
knife ssh "roles:gprd-base-db-pgbouncer-sidekiq" -- sudo chef-client
-
knife ssh 'roles:gprd-base-db-pgbouncer-sidekiq' -- sudo grep 'max_client_conn' /var/opt/gitlab/pgbouncer/pgbouncer.ini
: Output should be10000
-
-
Monitor for 10 minutes -
Run on ci: -
knife ssh "roles:gprd-base-db-pgbouncer-ci" -- sudo chef-client-enable
-
knife ssh "roles:gprd-base-db-pgbouncer-ci" -- sudo chef-client
-
knife ssh 'roles:gprd-base-db-pgbouncer-ci' -- sudo grep 'max_client_conn' /var/opt/gitlab/pgbouncer/pgbouncer.ini
: Output should be10000
-
-
Monitor for 10 minutes -
Slowly rollout on the rest of the fleet: -
knife ssh "roles:gprd-base-db-pgbouncer-pool" -- sudo chef-client-enable
-
knife ssh -C 1 "roles:gprd-base-db-pgbouncer-pool" -- sudo chef-client
-
-
Merge gitlab-com/runbooks!4871 (merged) -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 5
-
Revert https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/2159 -
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
- Metric: pgbouncer CPU usage
- Location: https://thanos-query.ops.gitlab.net/graph?g0.expr=rate(namedprocess_namegroup_cpu_seconds_total%7Benv%3D%22gprd%22%2Cfqdn%3D~%22pgbouncer.*%22%2Cgroupname%3D~%22pgbouncer.%2B%22%7D%5B1m%5D)&g0.tab=0&g0.stacked=0&g0.range_input=1d&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: Hitting 100% CPU usage
- Metric: Primary CPU usage
main
- Location: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=gprd&var-node=patroni-v12-03-db-gprd.c.gitlab-production.internal
- What changes to this metric should prompt a rollback: Increase in CPU usage
- Metric: Primary CPU usage
ci
- Location: https://dashboards.gitlab.net/d/bd2Kl9Imk/host-stats?orgId=1&var-env=gprd&var-node=patroni-ci-04-db-gprd.c.gitlab-production.internal
- What changes to this metric should prompt a rollback: Increase in CPU usage
- Metric: Total Time in Queries Per Node
- Location: https://dashboards.gitlab.net/explore?left=%7B%22datasource%22:%22Global%22,%22queries%22:%5B%7B%22expr%22:%22sum(rate(pgbouncer_stats_queries_duration_seconds_total%7Btype%3D~%5C%22pgbouncer.*%5C%22,%20environment%3D%5C%22gprd%5C%22%7D%5B$__interval%5D))%20by%20(fqdn)%5Cn%22,%22format%22:%22time_series%22,%22interval%22:%221m%22,%22intervalFactor%22:2,%22refId%22:%22B%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22PA258B30F88C30650%22%7D,%22hide%22:false%7D%5D,%22range%22:%7B%22from%22:%22now-6h%2Fm%22,%22to%22:%22now%2Fm%22%7D%7D&orgId=1
- What changes to this metric should prompt a rollback: Increase in query time
- Metric: Total Connection Wait Time
- Location: https://dashboards.gitlab.net/explore?left=%7B%22datasource%22:%22Global%22,%22queries%22:%5B%7B%22expr%22:%22sum%20by%20(database,%20environment,%20type)%20(rate(pgbouncer_stats_client_wait_seconds_total%7Btype%3D~%5C%22pgbouncer.%5C%22,%20environment%3D%5C%22gprd%5C%22,%20database!%3D%5C%22pgbouncer%5C%22%7D%5B$__interval%5D)%20%2F%20on()%20group_left()%20(vector((time()%20%3C%20bool%201588233600)%20%201000000)%20%3D%3D%201000000%20or%20vector(1)))%5Cn%22,%22format%22:%22time_series%22,%22interval%22:%221m%22,%22intervalFactor%22:1,%22refId%22:%22A%22,%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22PA258B30F88C30650%22%7D%7D%5D,%22range%22:%7B%22from%22:%22now-6h%2Fm%22,%22to%22:%22now%2Fm%22%7D%7D&orgId=1
- What changes to this metric should prompt a rollback: Increase in connection wait time
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Steve Xuereb