2020-07-07 & 2020-07-08: Sporadic HAProxy Backend Connection Errors to registry service
Summary
2020-07-07: Sporadic HAProxy Backend Connection Errors to registry service
Affecting fe-registry-01-lb-gprd and fe-registry-02-lb-gprd
Timeline
All times UTC.
2020-07-07
- 17:13 - Sporadic error rates coming from fe-registry-01 and fe-registry-02
- 17:30 - EOC is paged
- 17:58 - cindy declares incident in Slack using
/incident declare
command. - Gradual return to pre-incident levels at 12:15
2020-07-08
- 14:18 - RPS started dropping to the registry service
- 14:24 - Error rates started rising from the registry service
(source)
- 14:29 - HPA begins to scale down ReplicaSet
(source)
- 14:43 Alert for error burn rate SLO
- 14:54 Alert for 5xx Error Rate on Docker Registry Load Balancers
- 14:56 Alert comes in that the registry is down
- 15:30 The issue is updated to an ~S1 incident
- 15:44 ReplicaSet is manually increased while investigation continues
- 16:39 HAproxy nodes are rebooted and connectivity is restored between registry service and HAproxy nodes and briefly restores service and gradually moving towards a degraded state
(source)
- 16:45 New temporary HAproxy nodes are created and added to the load balancer
- Old HAproxy nodes are removed from load balancer
Incident Review
Summary
- Service(s) affected: Registry
- Team attribution: Reliability
- Minutes downtime or degradation: 3 hours
Metrics
- https://thanos-query.ops.gitlab.net/graph?g0.range_input=6h&g0.end_input=2020-07-08%2019%3A06&g0.max_source_resolution=0s&g0.expr=sum(rate(haproxy_backend_http_responses_total%7Binstance%3D~%22.registry.%22%7D%5B1m%5D))%20by%20(env%2C%20code%2C%20instance)&g0.tab=0
- https://thanos-query.ops.gitlab.net/graph?g0.range_input=6h&g0.end_input=2020-07-08%2019%3A22&g0.max_source_resolution=0s&g0.expr=sum(rate(node_netstat_Tcp_RetransSegs%7Binstance%3D~%22.registry.%22%7D%5B1m%5D))%20by%20(env%2C%20instance)&g0.tab=0
Customer Impact
- Who was impacted by this incident? External customers
- What was the customer experience during the incident? Could not access registry images
- How many customers were affected? All users trying to use the registry
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected? Alerting
- How could detection time be improved? Unsure if the first appearance of alert was related to the outage.
- How did we reach the point where we knew how to mitigate the impact? After narrowing down that the connection between HAproxy improved with a restart of the node
- How could time to mitigation be improved? **I'm not sure. **
Post Incident Analysis
- How was the root cause diagnosed? Root of issue is still unknown
- How could time to diagnosis be improved? Unknown
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? No
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No
5 Whys
Lessons Learned
Corrective Actions
- Add a temporary fe-lb-registry nodes
- Add temp registry LBs to Chef config
- Make temp haproxy load balancers for registry more robust
- Runbook for gather network data between HAproxy and services running in our k8s cluster
- Improve provisioning time for HAProxy nodes
Guidelines
Edited by John Jarvis