[PagerDuty] Registry - Increased Connection Error
Summary
Issue
PagerDuty alerts for:
- https://gitlab.pagerduty.com/incidents/P70HNTT - IncreasedBackendConnectionErrors
- https://gitlab.pagerduty.com/incidents/P3RW9E3 - IncreasedServerConnectionErrors
Investigation
Investigation revealed an issue with registry-01 and registry-04 where HAProxy was not getting responses back from these 2 nodes resulting in 503s to be thrown to users. The other nodes: registry-02 and registry-03 were fine. A deeper investigation revealed that memory usage on 01 and 04 has been creeping up over time, hitting close to the max available limit and this overlaps with the time the nodes started doing page faults, IO wait and IO ops increase.
Proximal Root Cause
The root cause for the increase in connection errors is due to memory usage hitting the ceiling on registry 01 and 04 causing page faults and IO wait. However, the root cause for why the memory consumption kept climbing is yet to be determined.
Action Items
-
Reboot registry-01 -
Reboot registry-04 -
Verify the alerts are cleared -
Root cause memory consumption climb (depending on the outcome, consider upgrading or research for possible memory leak?)
Edited by Amarbayar Amarsanaa