GitLab Connection Pool 1/thread sizing causes eventual bottleneck for replicas

Problem

From:

Screen_Shot_2022-06-09_at_3.55.16_pm

source

we can see that we are reaching ~60% saturation of client connections during peak hours.

These client connections are from GitLab rails processes to PGBouncer processes running on the Patroni nodes.

This 60% saturation is despite the fact that there is plenty of headroom with the large number of Patroni replicas we have.

This problem cannot be fixed by adding more PGBouncer processes nor by adding more replica nodes (even though both of these should be theoretically horizontally scalable). This problem (as I'll explain) is caused by an architectural decision we made in the rails application that we can and should fix before start saturating connections.

We also noticed this problem recently in this incident #7139 (closed) . We learnt not to try that again but still it showed that there is a bottleneck waiting for us in future.

How the problem occurs in detail

There is a good summary of our architecture at https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/pgbouncer/patroni-consul-postgres-pgbouncer-interactions.md .

Given our current architecture each rails processes starts up and queries Consul DNS service for db-replica.service.consul. It gets back a result for every PGBouncer on every replica node. Each replica node contains 3 pgbouncers. So if there are 8 replicas then this is 8x3=24 results. Now Rails opens up a connection pool to every one of these PGBouncers. Thus 24 connection pools. For the pool size it uses the number of threads of the Rails process. So if we assume we run Puma with 50 threads then each rails process is opening 50*24 = 1200 connections to pgbouncers.

When we see spikes in traffic our K8s cluster autoscales up the number of rails pods. If we have 1000 rails pods then this is 1200*1000 = 1.2M connections to PGBouncer.

Client connections to PGBouncer are generally cheap except we have a problem that PGBouncer runs single threaded so we have a hardcoded limit of 20k connections per pgbouncer.

And we can't fix this problem by just spinning up more pgbouncers because each rails process will see this as another server and just create more connections.

So ultimately we will reach saturation when GitLab.com traffic is high enough that the number of K8s pods needed to keep up with traffic exceeds a certain number. The limit of the number of pods we can run would be:

20000 / num_threads_per_pod

Note on CI Decomposition

This problem is not improved nor made worse by CI decomposition. CI decomposition means that each rails process is actually creating a set of connection pools for main and ci so it's actually doubling the number of client connections from our rails processes. But these new client connections are also going to a separate set of PGBouncers/Patroni hosts so they don't bring us closer to the 20k limit.

Solution

We are being tremendously wasteful with how we open up connections. We, however, cannot simply just reduce the pool sizes because there is a risk that one pool is the most up to date and the rails process needs 1 connection per thread in a single pool.

One solution might be to put a limit on the number of replicas each rails process can connect to. Such that if there are more than 10 replicas available to a rails process it will randomly choose 10 of those replicas. This would mean that each rails process is only ever creating at most 10*num_threads client connections and then we can now increase the number of replicas or number of pgbouncers without increasing the number of client connections.

This solution is likely sufficient when all replicas are keeping up to date but it might occasionally mean some rails process are unlucky and get a set of replicas that quite far behind and this will be inefficient because Rails will use those replicas less as it determines if a replica is up to date enough for a given workload (user). Solving that problem may require a more sophisticated approach to deciding which replicas to connect to or how to size connection pools for each replica.

Edited by Yannis Roussos