Registry service cannot reach all replica pgbouncers in GPRD and GSTG
Problem Description
While examining some teleport documentation confusion, I noticed some other problems with registry database that seems concerning. At first, this just appeared to be problems with GSTG, but when I took a closer look, I am also concerned that GPRD is operating in a degraded mode right now. We can see this in the logs for GPRD:
The registry service in GSTG is frequently writing errors that it cannot talk to the replicas. Source
* failed to open replica "10.224.117.103:6436" database connection: verification failed: failed to connect to `user=gitlab-registry database=gitlabhq_registry`: 10.224.117.103:6436 (10.224.117.103): dial error: timeout: context deadline exceeded
* failed to open replica "10.224.117.104:6434" database connection: verification failed: context deadline exceeded
Those IP addresses are the correct addresses and ports for the replica VMs. When I examine the replica pgbouncer metrics for GSTG and GPRD, I also see an unusual pattern.
GSTG
GSTG seems to show virtually no traffic.
GPRD
GPRD on the other hand, seems to show a single port on a single replica seeing any traffic.
Terraform differences
I also found that there is a defined pgbouncer load balancer in GPRD with a port open for 6432, the same port for the pgbouncer that is working. But GSTG has no such load balancer.
Ideas
- Is this just a networking rule problem? Consider creating a temporary firewall rule in GSTG to allow the GKE services for registry and registry-cny to talk to the replica pool on all the pgbouncer defined ports to see if the errors go away. If so, maybe we need to add a defined rule in Terraform.
- Is the load balancer (which may not be used) providing this port allow rule incidentally in GPRD? Can that be verified? Maybe add a similar load balancer in GSTG and see if it then matches prod. Then actually fix the firewall problems.
- Maybe spin up a toolbox image inside the registry namespace (or install networking tools on a running pod) and verify if the replica pgbouncer ports can be reached.


