Multithreaded runtimes - Dynamic pool size scaling logic amendments
For Puma, which is multi-threaded, we added some logic to scale connection pools with thread count (see this MR as well as this follow-up issue).
However, we ran into a number of issues with the initial logic, such as:
- Due to an ordering problem in the Rails initializers, Geo node connections would not get properly scaled, leading to connectivity issues (issue)
- Due to low concurrency and low default settings for some systems, we likewise ran into similar issues with connectivity (RCA)
Furthermore, these issues apply to all multi-threaded runtimes like Sidekiq, not just Puma.
A number of suggestions that have been made to prevent this from happening again:
- Add a check that detects degenerate pool sizes (Issue, MR) after all initializers have run.
- Install a lower boundary for connection pools. Historically we had used a value of
5
although it's unclear where this recommendation came from. (Issue) - Install an upper boundary for connection pools. As @msmiley pointed out, this could otherwise create contention at the pgbouncer level (the remote end) if too many threads attempt to simultaneously transfer data over the wire. For higher concurrencies, it would also be worth considering adding a few extra connections to the pool than the concurrency, as an insurance against deadlocks.
- We should report, through prometheus, connection pool utilisation in a process (on an interval, eg every 10 seconds?) - as a gauge metric. We should also record maximum pool size, so that we can add connection pool saturation as a saturation metric (Issue)
We will focus on DB connections, not Redis in this issue.
Edited by Matthias Käppler