Skip to content

Investigate saturation of read-only replica PGbouncer (and lack of load distribution)

During production#1054 (closed), we observed the following behavior:

  • At 8:30a UTC patroni01 experienced a network issue that caused Patroni to fail over to patroni04

    • patroni01 ended up in a corrupted state and was offline
    • patroni04 became the new master
    • 1 master, 5 replicas
  • patroni06 immediately became overloaded:

    • server connections maxed out at 100 (as configured)
    • active clients dropped
    • waiting connections increased

Screen_Shot_2019-08-14_at_9.22.28_PM

  • however, all other replicas, while registering the failover blip, managed to stabilize nearly inmediately:

Screen_Shot_2019-08-14_at_9.27.34_PM

The behavior of 06 wasn't expected. As an experiment, we took 06 out of the cluster temporarily. What we observed was 03 crater under the excess load:

Screen_Shot_2019-08-15_at_12.10.40_AM

When 01 was finally restored an added to the cluster as a read-only replica, we attemped the experiement of pulling 06 out of the rotor. As 03 did, it cratered:

Screen_Shot_2019-08-15_at_12.12.25_AM

@stanhu checked the internal list of hosts (no screenshot or data saved) and it simply did not seem to have enough entropy. From memory, 05 was the first database replica in the list. After we added 01 to the cluster, 05 was still first in the list and 01 was last. With a sample of 1 observation, the combination we saw post rejoin is possible, but it seems unlikely. We expected not to see 05 in the same spot as before, and we didn't expect to see 01 last.

It is worth noting that once we were back to 5 replicas, we started seeing a slow recovery on 06:

Screen_Shot_2019-08-15_at_12.23.27_AM

Thus, there is clearly a capacity component to this riddle. As a precaution, we added another database replica, 07, in hopes that this addition will buy us some runway in case of having another replica fail.

Edited by Jose Finotto