Investigate saturation risk for PgBouncer client connections
Problem
Starting on 2021-12-14, the number of client connections to pgbouncer abruptly increased by 30% of its total capacity. This utilization increase was apparently unexpected and capacity was not proactively added to compensate for it.
Mitigation
We can (and today did) again increase the configured max client connections, but to support better capacity planning and safety checks, we want to understand why a large percentage of our capacity was consumed.
Why this matters
For context, there are no user-facing effects until the saturation point (100% utilization) is reached. At that point, the puma threads will intermittently fail to connect to pgbouncer, due to reaching this limit. This connection error should be a fast failure, and it may be (at least sometimes) hidden from users by our client-side code falling back to either another replica db or the primary db. But those protections only go so far and mask the real problem. So before reaching that condition, we want to understand what configuration or workload pattern caused the abrupt increase to the number of client connections to pgbouncer.
Prior work
We started to investigate this in December, reopening the issue where we had previously increased the capacity due to organic growth. However, the change on 2021-12-14 turned out to not be organic growth. The following notes walk through methodology and results for the initial investigation:
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14677#note_791874834
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14677#note_791930144
Results summary from initial investigation:
- The
webservice-webfleet had the largest number of connections per pod (all pods had over 20 connections). - The
webservice-gitfleet had the largest number of pods. - Either of the above count potentially explain why pgbouncer sees more client connections, at all times of day, even during the weekend where pod count is at its weekly low point.
- web: Was there a configuration change that implicitly added more db connections per pod?
- git: Was there a configuration or workload change that added more of these pods?
