503 errors on about.gitlab.com
/label incident
Summary
503 errors on about.gitlab.com
Receiving these errors when accessing any pages under about.gitlab.com
Error 503 Service Unavailable
Service Unavailable
Guru Mediation:
Details: cache-syd10133-SYD 1590040342 2061637431
Varnish cache server
Timeline
All times UTC.
2020-05-22
- 05:53 - tfigueiro declares incident in Slack using
/incident declare
command. - ..... - Troubleshooting, errors from Brisbane and Sydney. No sign of errors anywhere else.
- 05:10 - no sign of any further errors. Closing incident.
- 06:30 - Follow up report from Sydney that everything is working fine. No errors.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
2 Users in Sydney and Brisbane were seeing intermittent 503 errors referencing cache-syd10133-SYD
. The errors appeared for about 20% of requests to the handbook pages. These didn't cause an increased error rate in our dashboards, and the errors weren't visible from anywhere else in the world. The problem appeared to be isolated to Fastly, although there was no reference of any problem on their status page. It did show Re-routing
for their Vietnam location, but it's unclear whether Sydney traffic usually routes through there. No action was taken, and the users stopped noticing any errors after about 15 minutes.
- Service(s) affected: About.GitLab.com
- Team attribution: core-infra (Fastly)
- Minutes downtime or degradation: 15
Metrics
Errors did not show up on our dashboards. See screenshots in comments.
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers) - 2 internal customers
- What was the customer experience during the incident? Handbook pages were not loading
- How many customers were affected? None known
- If a precise customer impact number is unknown, what is the estimated potential impact? Potentially some Australian customers may have seen some 503 errors while looking at handbook pages
Incident Response Analysis
- How was the event detected? Reported in production channel
- How could detection time be improved? We could have automated error checking from multiple external locations
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?