Investigate application start failures when all known read-only replicas are not responsive.
Summary
In the Incident Review meeting earlier today (05 December), corrective action item 13 in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8528 was skipped in favor of a more in-depth discussion.
We need to investigate why losing just 2 patroni node affected us
The meeting was nearly over, so I stated that a follow-up meeting should be scheduled to discuss the concern. Most of the individuals in the call assembled for shortly after for a separate call, and wed ended up discussing the matter then. It became evident quickly that more information needs to be uncovered before a fruitful conversation can take place.
Definition of Done
-
Outline the technical conditions under which the failure occurred. -
Provide steps to reproduce the behavior.
-
-
Document this scenario in the appropriate section of our runbook. -
@ansdval will determine a priority level and bring it to the attention of the development teams in the Performance & Availability meeting using ~infradev
and gitlab.com label combination.
Edited by Alejandro Rodríguez