Investigate saturation of read-only replica PGbouncer (and lack of load distribution)
During production#1054 (closed), we observed the following behavior:
-
At 8:30a UTC
patroni01experienced a network issue that caused Patroni to fail over topatroni04-
patroni01ended up in a corrupted state and was offline -
patroni04became the new master -
1master,5replicas
-
-
patroni06immediately became overloaded:- server connections maxed out at 100 (as configured)
- active clients dropped
- waiting connections increased
- however, all other replicas, while registering the failover blip, managed to stabilize nearly inmediately:
The behavior of 06 wasn't expected. As an experiment, we took 06 out of the cluster temporarily. What we observed was 03 crater under the excess load:
When 01 was finally restored an added to the cluster as a read-only replica, we attemped the experiement of pulling 06 out of the rotor. As 03 did, it cratered:
@stanhu checked the internal list of hosts (no screenshot or data saved) and it simply did not seem to have enough entropy. From memory, 05 was the first database replica in the list. After we added 01 to the cluster, 05 was still first in the list and 01 was last. With a sample of 1 observation, the combination we saw post rejoin is possible, but it seems unlikely. We expected not to see 05 in the same spot as before, and we didn't expect to see 01 last.
It is worth noting that once we were back to 5 replicas, we started seeing a slow recovery on 06:
Thus, there is clearly a capacity component to this riddle. As a precaution, we added another database replica, 07, in hopes that this addition will buy us some runway in case of having another replica fail.




