Support multiple write load balancers in the application

Today we had an incident where network traffic to pgbouncer-01 was severed from the rest of the network. You can see these in the gaps in the PgBouncer Prometheus metrics (https://performance.gitlab.net/dashboard/db/pgbouncer-data?orgId=1&from=1530202758807&to=1530206678459):

image

We suspect this led to all unicorn workers in the fleet timing out after 60 s and restarting. When the network came back a few minutes later, everything started working again.

We could solve this at an infrastructure level (https://gitlab.com/gitlab-com/infrastructure/issues/4065) by using a load balancer in front of pgbouncer. However, this didn't work in Azure because we saw a significant performance hit by using the Azure Load Balancer, but it might be possible in GCP. We may want to set this up in GCP to see if we see better performance.

It might be better to solve this at the application level. Right now the EE load balancing code supports reads from multiple secondaries. There is no way to supply multiple primary hosts. Ideally if pgbouncer-01 goes down, another pgbouncer can be used to talk to the primary.

We could poll each load_balancing entry to see if SELECT pg_is_in_recovery() to determine whether the node should be a candidate for a write.

/cc: @yorickpeterse, @andrewn, @dawsmith

Assignee Loading
Time tracking Loading