Support high availability for database load balancing
When a Postgres database replica degrades we see a major degradation across all GitLab services because application load balancing was designed for load distribution, not high availability. See https://docs.gitlab.com/ee/administration/database_load_balancing.html
We have previously discussed moving load balancing away from the application to a load balancing proxy but there were some points made that shows that our own load balancing code is necessary to ensure that reads go to the primary immediately after a write and other special logic.
We have seen in several high severity incidents like production#4874 (closed) and production#4820 (closed) where losing a replica will cause a large spike of errors for typically < 5 minutes. This spike, while short, can be quite disruptive.
- For the case where a replica goes offline, should we decrease the discovery interval from the default of 60seconds, so we can detect it sooner?
- Should we push more health logic into console? For example the application checks for replication delay but this could be done by the service status instead, we could also incorporate GCP maintenance events to mark the replica as unhealthy https://cloud.google.com/compute/docs/storing-retrieving-metadata#maintenanceevents and remove it from the pool, which has a 60 second lead time.