Investigate saturation risk for PgBouncer client connections

Problem

Starting on 2021-12-14, the number of client connections to pgbouncer abruptly increased by 30% of its total capacity. This utilization increase was apparently unexpected and capacity was not proactively added to compensate for it.

Screenshot_from_2022-01-10_09-50-56

Mitigation

We can (and today did) again increase the configured max client connections, but to support better capacity planning and safety checks, we want to understand why a large percentage of our capacity was consumed.

Why this matters

For context, there are no user-facing effects until the saturation point (100% utilization) is reached. At that point, the puma threads will intermittently fail to connect to pgbouncer, due to reaching this limit. This connection error should be a fast failure, and it may be (at least sometimes) hidden from users by our client-side code falling back to either another replica db or the primary db. But those protections only go so far and mask the real problem. So before reaching that condition, we want to understand what configuration or workload pattern caused the abrupt increase to the number of client connections to pgbouncer.

Prior work

We started to investigate this in December, reopening the issue where we had previously increased the capacity due to organic growth. However, the change on 2021-12-14 turned out to not be organic growth. The following notes walk through methodology and results for the initial investigation:

Results summary from initial investigation:

  • The webservice-web fleet had the largest number of connections per pod (all pods had over 20 connections).
  • The webservice-git fleet had the largest number of pods.
  • Either of the above count potentially explain why pgbouncer sees more client connections, at all times of day, even during the weekend where pod count is at its weekly low point.
    • web: Was there a configuration change that implicitly added more db connections per pod?
    • git: Was there a configuration or workload change that added more of these pods?