2020-07-07 & 2020-07-08: Sporadic HAProxy Backend Connection Errors to registry service

Summary

2020-07-07: Sporadic HAProxy Backend Connection Errors to registry service

Affecting fe-registry-01-lb-gprd and fe-registry-02-lb-gprd

Timeline

Screen_Shot_2020-07-09_at_2.53.00_PM

All times UTC.

2020-07-07

  • 17:13 - Sporadic error rates coming from fe-registry-01 and fe-registry-02
  • 17:30 - EOC is paged
  • 17:58 - cindy declares incident in Slack using /incident declare command.
  • Gradual return to pre-incident levels at 12:15

2020-07-08

  • 14:18 - RPS started dropping to the registry service
  • 14:24 - Error rates started rising from the registry service Screen_Shot_2020-07-09_at_12.00.01_PM (source)
  • 14:29 - HPA begins to scale down ReplicaSet Screen_Shot_2020-07-09_at_3.01.17_PM (source)
  • 14:43 Alert for error burn rate SLO
  • 14:54 Alert for 5xx Error Rate on Docker Registry Load Balancers
  • 14:56 Alert comes in that the registry is down
  • 15:30 The issue is updated to an ~S1 incident
  • 15:44 ReplicaSet is manually increased while investigation continues
  • 16:39 HAproxy nodes are rebooted and connectivity is restored between registry service and HAproxy nodes and briefly restores service and gradually moving towards a degraded state Screen_Shot_2020-07-09_at_3.11.34_PM (source)
  • 16:45 New temporary HAproxy nodes are created and added to the load balancer
  • Old HAproxy nodes are removed from load balancer

Incident Review

Summary

  1. Service(s) affected: Registry
  2. Team attribution: Reliability
  3. Minutes downtime or degradation: 3 hours

Metrics

  • https://thanos-query.ops.gitlab.net/graph?g0.range_input=6h&g0.end_input=2020-07-08%2019%3A06&g0.max_source_resolution=0s&g0.expr=sum(rate(haproxy_backend_http_responses_total%7Binstance%3D~%22.registry.%22%7D%5B1m%5D))%20by%20(env%2C%20code%2C%20instance)&g0.tab=0
  • https://thanos-query.ops.gitlab.net/graph?g0.range_input=6h&g0.end_input=2020-07-08%2019%3A22&g0.max_source_resolution=0s&g0.expr=sum(rate(node_netstat_Tcp_RetransSegs%7Binstance%3D~%22.registry.%22%7D%5B1m%5D))%20by%20(env%2C%20instance)&g0.tab=0

Customer Impact

  1. Who was impacted by this incident? External customers
  2. What was the customer experience during the incident? Could not access registry images
  3. How many customers were affected? All users trying to use the registry
  4. If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

  1. How was the event detected? Alerting
  2. How could detection time be improved? Unsure if the first appearance of alert was related to the outage.
  3. How did we reach the point where we knew how to mitigate the impact? After narrowing down that the connection between HAproxy improved with a restart of the node
  4. How could time to mitigation be improved? **I'm not sure. **

Post Incident Analysis

  1. How was the root cause diagnosed? Root of issue is still unknown
  2. How could time to diagnosis be improved? Unknown
  3. Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? No
  4. Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? No

5 Whys

Lessons Learned

Corrective Actions

  • Add a temporary fe-lb-registry nodes
  • Add temp registry LBs to Chef config
  • Make temp haproxy load balancers for registry more robust
  • Runbook for gather network data between HAproxy and services running in our k8s cluster
  • Improve provisioning time for HAProxy nodes

Guidelines

  • Blameless RCA Guideline
Edited Jul 21, 2020 by John Jarvis
Assignee Loading
Time tracking Loading