503 errors on about.gitlab.com

Summary

503 errors on about.gitlab.com

Receiving these errors when accessing any pages under about.gitlab.com

Error 503 Service Unavailable

Service Unavailable
Guru Mediation:

Details: cache-syd10133-SYD 1590040342 2061637431

Varnish cache server

Timeline

All times UTC.

2020-05-22

05:53 - tfigueiro declares incident in Slack using /incident declare command.
..... - Troubleshooting, errors from Brisbane and Sydney. No sign of errors anywhere else.
05:10 - no sign of any further errors. Closing incident.
06:30 - Follow up report from Sydney that everything is working fine. No errors.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

2 Users in Sydney and Brisbane were seeing intermittent 503 errors referencing cache-syd10133-SYD. The errors appeared for about 20% of requests to the handbook pages. These didn't cause an increased error rate in our dashboards, and the errors weren't visible from anywhere else in the world. The problem appeared to be isolated to Fastly, although there was no reference of any problem on their status page. It did show Re-routing for their Vietnam location, but it's unclear whether Sydney traffic usually routes through there. No action was taken, and the users stopped noticing any errors after about 15 minutes.

Service(s) affected: About.GitLab.com
Team attribution: core-infra (Fastly)
Minutes downtime or degradation: 15

Metrics

Errors did not show up on our dashboards. See screenshots in comments.

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers) - 2 internal customers
What was the customer experience during the incident? Handbook pages were not loading
How many customers were affected? None known
If a precise customer impact number is unknown, what is the estimated potential impact? Potentially some Australian customers may have seen some 503 errors while looking at handbook pages

Incident Response Analysis

How was the event detected? Reported in production channel
How could detection time be improved? We could have automated error checking from multiple external locations
How did we reach the point where we knew how to mitigate the impact?
How could time to mitigation be improved?

Post Incident Analysis

How was the root cause diagnosed?
How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?

503 errors on about.gitlab.com

Summary