Skip to content

Improve application database load balancing logic to handle unresponsive DBs

Summary

In this incident, GitLab.com became unavailable due to the underlying physical disk of a newly-added Patroni CI replica becoming unhealthy and stalling on I/O. Many queries (including ones that are normally extremely cheap) started failing with a statement timeout. The puma and sidekiq workers use a round-robin loadbalancing scheme for delegating most read-only queries to replicas. That scheme tends to stall on any slow replica.

Related Incident(s)

Originating issue(s): gitlab-com/gl-infra/production#18269 (closed)

Desired Outcome/Acceptance Criteria

The application database load balancer behaved badly in response to degredation (but not complete failure) on patroni-ci-05. We should improve the load balancing logic to be smarter during this slow-but-not-dead case.

Specifically:

  • Run health check queries with a very low statement timeout
  • Treat statement timeouts in health checks as errors that should lead to marking the host unhealthy

Associated Services

ServicePostgres in GitLab.com / GitLab Infrastructure Team / Production Engineering

Corrective Action Issue Checklist

  • Link the incident(s) this corrective action arose from
  • Give context for what problem this corrective action is trying to prevent re-occurring
  • Assign a severity label (this is the highest sev of related incidents, defaults to 'severity::4')
  • Assign a priority (this will default to 'Reliability::P4' but should match the severity of the related incident)
  • Assign a service label
  • Assign a team label
Edited by Alex Ives