Skip to content

Registry service cannot reach all replica pgbouncers in GPRD and GSTG

Problem Description

While examining some teleport documentation confusion, I noticed some other problems with registry database that seems concerning. At first, this just appeared to be problems with GSTG, but when I took a closer look, I am also concerned that GPRD is operating in a degraded mode right now. We can see this in the logs for GPRD:

Screenshot_2025-09-30_at_1.16.34_PM Source

The registry service in GSTG is frequently writing errors that it cannot talk to the replicas. Source

	* failed to open replica "10.224.117.103:6436" database connection: verification failed: failed to connect to `user=gitlab-registry database=gitlabhq_registry`: 10.224.117.103:6436 (10.224.117.103): dial error: timeout: context deadline exceeded
	* failed to open replica "10.224.117.104:6434" database connection: verification failed: context deadline exceeded

Those IP addresses are the correct addresses and ports for the replica VMs. When I examine the replica pgbouncer metrics for GSTG and GPRD, I also see an unusual pattern.

GSTG

Screenshot_2025-09-30_at_1.05.56_PM Source

GSTG seems to show virtually no traffic.

GPRD

Screenshot_2025-09-30_at_1.07.24_PM Source

GPRD on the other hand, seems to show a single port on a single replica seeing any traffic.

Terraform differences

I also found that there is a defined pgbouncer load balancer in GPRD with a port open for 6432, the same port for the pgbouncer that is working. But GSTG has no such load balancer.

Ideas

  1. Is this just a networking rule problem? Consider creating a temporary firewall rule in GSTG to allow the GKE services for registry and registry-cny to talk to the replica pool on all the pgbouncer defined ports to see if the errors go away. If so, maybe we need to add a defined rule in Terraform.
  2. Is the load balancer (which may not be used) providing this port allow rule incidentally in GPRD? Can that be verified? Maybe add a similar load balancer in GSTG and see if it then matches prod. Then actually fix the firewall problems.
  3. Maybe spin up a toolbox image inside the registry namespace (or install networking tools on a running pod) and verify if the replica pgbouncer ports can be reached.
Edited by Cameron McFarland