[PagerDuty] Registry - Increased Connection Error

Summary

Issue

PagerDuty alerts for:

  • https://gitlab.pagerduty.com/incidents/P70HNTT - IncreasedBackendConnectionErrors
  • https://gitlab.pagerduty.com/incidents/P3RW9E3 - IncreasedServerConnectionErrors

Investigation

Investigation revealed an issue with registry-01 and registry-04 where HAProxy was not getting responses back from these 2 nodes resulting in 503s to be thrown to users. The other nodes: registry-02 and registry-03 were fine. A deeper investigation revealed that memory usage on 01 and 04 has been creeping up over time, hitting close to the max available limit and this overlaps with the time the nodes started doing page faults, IO wait and IO ops increase.

Proximal Root Cause

The root cause for the increase in connection errors is due to memory usage hitting the ceiling on registry 01 and 04 causing page faults and IO wait. However, the root cause for why the memory consumption kept climbing is yet to be determined.

Action Items

  • Reboot registry-01
  • Reboot registry-04
  • Verify the alerts are cleared
  • Root cause memory consumption climb (depending on the outcome, consider upgrading or research for possible memory leak?)
Edited Nov 05, 2018 by Amarbayar Amarsanaa
Assignee Loading
Time tracking Loading