Why do the PgBouncer hosts have a very uneven distribution of client connections?
Problem:
A large majority of database client connections are being handled by a single PgBouncer instance, despite the fact that 2 PgBouncer instances are active backends in our Google Internal TCP Load Balancer. This imbalance defeats our attempt to alleviate our bottleneck of PgBouncer's CPU saturation.
Goal:
Learn why traffic is not being roughly equally distributed among the active members of the Google Internal TCP Load Balancer (ILB). This is a research task that should result in a recommendation.
Alternatives:
If we cannot fix this imbalance, we could instead use HAProxy as a load balancer in front of several PgBouncer instances. This would optionally let us run multiple PgBouncer instances per host. However, unlike the Google ILB, HAProxy would be an additional TCP endpoint, adding more network latency and another component to manage.
Background:
- Recently (~4-8 weeks ago) in response to a performance problem we changed how database clients connect to the primary Postgres instance in our Patroni cluster.
- Previously, all clients needing write-access to Postgres would connect to a single PgBouncer instance running on the same host as the primary Postgres instance.
- PgBouncer is a single-threaded process, so it can use at most 1 CPU's worth of compute cycles.
- That PgBouncer instance's workload sometimes saturated its 1 CPU ceiling under our peak daily workload.
- To spread that workload among more CPUs, we needed to add more PgBouncer instances and treat them as a pool.
- We reconfigured the DB Clients connect to the primary Postgres database via a Google Internal TCP Load Balancer (ILB) that points to a pool of currently 2 active PgBouncer instances.
- The ILB is not a proxy; it is a set of network routing rules that lets clients talk directly to the backend service instances (PgBouncer hosts).
- The ILB's backend-service actually contains 2 active hosts and 1 inactive host. Only the 2 active hosts matter for this analysis.
- Surprisingly, the workload is not evenly distributed among the PgBouncer hosts.
- The number of established connections from clients to PgBouncer is very unevenly distributed among the 2 active PgBouncer hosts.
- This unevenness defeats our goal of spreading the workload among multiple PgBouncer instances.
- Clients whose hostnames start with "web" and "git" seem to be the most strongly affected by this bias for preferring to connect to host "pgbouncer-03" rather than "pgbouncer-02".