Skip to content

Investigate application start failures when all known read-only replicas are not responsive.

Summary

In the Incident Review meeting earlier today (05 December), corrective action item 13 in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8528 was skipped in favor of a more in-depth discussion.

We need to investigate why losing just 2 patroni node affected us

The meeting was nearly over, so I stated that a follow-up meeting should be scheduled to discuss the concern. Most of the individuals in the call assembled for shortly after for a separate call, and wed ended up discussing the matter then. It became evident quickly that more information needs to be uncovered before a fruitful conversation can take place.

Definition of Done

  • Outline the technical conditions under which the failure occurred.
    • Provide steps to reproduce the behavior.
  • Document this scenario in the appropriate section of our runbook.
  • @ansdval will determine a priority level and bring it to the attention of the development teams in the Performance & Availability meeting using ~infradev and gitlab.com label combination.
Edited by Alejandro Rodríguez