If one Postgres node is having performance issues, the others should not notice it. Step 1: develop a test to reproduce the problem
In production#8156 (closed), we saw that if one of standby nodes are having issues, all standby nodes in Postgres cluter have a dip in traffic (particular thread where it was discussed: production#8156 (comment 1208214511))
This makes each node a SPOF. Normally, if one node has performance issues, the others shouldn't have any issues at all.
The final goal is to discuss possible solutions and achieve better resilience of Postgres clusters for GitLab.com. Most likely, the changes will be needed in application side code, but I think we should start in reproducing the problem in lower environments, so anyone can see it in action. That's the goal of this particular issue – develop a simple test to be able to reproduce the problem in lower environments, including gstg.
With @fomin.list, we discussed a simple way to slow down one Postgres node - in a loop, lock pg_class of X ms, so for all queries coming to this node, latencies are artificially increased. Once one node is slowed down, we observe the others and see if the metrcis on them are somehow affected. There are also alternative ideas, but we expect this test to be very simple and reliable.