More scalable database connection pooling (currently pgbouncer)
We've had significant degradation of database query latencies when we scaled up the web worker fleet and/or bumped pgbouncer's max_client_conn
. Incident ticket: production#895 (closed)
The current pgbouncer setup is that we run one pgbouncer process per database instance. pgbouncer connects to the database on the same instance it runs on.
We know those processes are reaching their limits in terms of CPU (pegging one core). Since pgbouncer is single-threaded, we know have to think about taking the architecture one step further to be able to add more clients to pgbouncer (e.g. web nodes, sidekiq).
Note that we may have to come up with a short-term solution to mitigate the ongoing issue and we're looking for both a short- and long-term solution (should those be different).