2020-07-07 & 2020-07-08: Sporadic HAProxy Backend Connection Errors to registry service

Summary

Affecting fe-registry-01-lb-gprd and fe-registry-02-lb-gprd

All times UTC.

2020-07-07

2020-07-08

14:18 - RPS started dropping to the registry service
14:24 - Error rates started rising from the registry service (source)
14:29 - HPA begins to scale down ReplicaSet (source)
14:43 Alert for error burn rate SLO
14:54 Alert for 5xx Error Rate on Docker Registry Load Balancers
14:56 Alert comes in that the registry is down
15:30 The issue is updated to an ~S1 incident
15:44 ReplicaSet is manually increased while investigation continues
16:39 HAproxy nodes are rebooted and connectivity is restored between registry service and HAproxy nodes and briefly restores service and gradually moving towards a degraded state (source)
16:45 New temporary HAproxy nodes are created and added to the load balancer
Old HAproxy nodes are removed from load balancer

Who was impacted by this incident? External customers
What was the customer experience during the incident? Could not access registry images
How many customers were affected? All users trying to use the registry
If a precise customer impact number is unknown, what is the estimated potential impact?

How was the event detected? Alerting
How could detection time be improved? Unsure if the first appearance of alert was related to the outage.
How did we reach the point where we knew how to mitigate the impact? After narrowing down that the connection between HAproxy improved with a restart of the node
How could time to mitigation be improved? **I'm not sure. **

How was the root cause diagnosed? Root of issue is still unknown
How could time to diagnosis be improved? Unknown
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? No
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No