Improve application database load balancing logic to handle unresponsive DBs
Summary
In this incident, GitLab.com became unavailable due to the underlying physical disk of a newly-added Patroni CI replica becoming unhealthy and stalling on I/O. Many queries (including ones that are normally extremely cheap) started failing with a statement timeout. The puma and sidekiq workers use a round-robin loadbalancing scheme for delegating most read-only queries to replicas. That scheme tends to stall on any slow replica.
Related Incident(s)
Originating issue(s): gitlab-com/gl-infra/production#18269 (closed)
Desired Outcome/Acceptance Criteria
The application database load balancer behaved badly in response to degredation (but not complete failure) on patroni-ci-05
. We should improve the load balancing logic to be smarter during this slow-but-not-dead case.
Specifically:
-
Run health check queries with a very low statement timeout -
Treat statement timeouts in health checks as errors that should lead to marking the host unhealthy
Associated Services
ServicePostgres in GitLab.com / GitLab Infrastructure Team / Production Engineering
Corrective Action Issue Checklist
-
Link the incident(s) this corrective action arose from -
Give context for what problem this corrective action is trying to prevent re-occurring -
Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4') -
Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident) -
Assign a service label -
Assign a team label