Improve application database load balancing logic to handle unresponsive DBs

Summary

In this incident, GitLab.com became unavailable due to the underlying physical disk of a newly-added Patroni CI replica becoming unhealthy and stalling on I/O. Many queries (including ones that are normally extremely cheap) started failing with a statement timeout. The puma and sidekiq workers use a round-robin loadbalancing scheme for delegating most read-only queries to replicas. That scheme tends to stall on any slow replica.

Related Incident(s)

Originating issue(s): gitlab-com/gl-infra/production#18269 (closed)

Desired Outcome/Acceptance Criteria

The application database load balancer behaved badly in response to degredation (but not complete failure) on patroni-ci-05. We should improve the load balancing logic to be smarter during this slow-but-not-dead case.

Specifically:

Run health check queries with a very low statement timeout
Treat statement timeouts in health checks as errors that should lead to marking the host unhealthy

Associated Services

ServicePostgres in GitLab.com / GitLab Infrastructure Team / Production Engineering

Corrective Action Issue Checklist

Link the incident(s) this corrective action arose from
Give context for what problem this corrective action is trying to prevent re-occurring
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident)
Assign a service label
Assign a team label

Edited Jan 14, 2025 by Alex Ives