Skip to content

2025-10-20: Loadbalancer 5xx error rate for web service in cny stage exceeding SLO

Loadbalancer 5xx error rate for web service in cny stage exceeding SLO (Severity 3 (Medium))

Problem: A spike in 5xx errors affected both web and web-pages services in the cny stage, caused by connection refused and connection reset by peer errors during canary deployments.

Impact: Error rates exceeded SLO thresholds for both web and web-pages services in the cny stage, with a 0.248% 5xx error rate for web and up to 1.071% for web-pages in us-east1. Foreground requests like '/git-upload-pack' returned 5xx errors, impacting users between 23:31 and 23:42 UTC. A subsequent deployment completed with no increase in 500 errors, confirming the issue is resolved.

Causes: An edge case in recent readiness check changes for webservice pods caused readiness probe failures during canary deployments. This led to pod terminations and increased 5xx errors. The problematic change was only rolled out to cny and gprd-us-east1-b. Reverts have now been merged to address the issue.

Response strategy: We merged reverts for the recent readiness check changes. After a new deployment to gprd-cny, no further spikes in 5xx errors were observed.


This ticket was created to track INC-5089, by incident.io 🔥