RCA for High error rate on GitLab.com
Summary
On 2019-03-14 we started to see errors on the GitLab frontend fleet indicating a large number of users were receiving 500s for web requests. The root cause was a header introduced to handle a cross-origin-request via the GitLab.com CDN. This change caused an error on production that was not seen on staging and needed to be reverted on production.
Example endpoints were affected by the front-end:
- https://gitlab.com/gitlab-org/gitlab-ce/issues
- https://gitlab.com/gitlab-org/gitlab-ce/
- https://gitlab.com/gitlab-org/gitlab-ce/tree/master
These services were not impacted as it was specific to the web frontend
- git ssh
- git https
- registry
- pages
Service(s) affected : Web Team attribution : Minutes downtime or degradation : 30 minutes
Impact & Metrics
Start with the following:
- What was the impact of the incident? Outage for some web endpoints
- Who was impacted by this incident? Web users of GitLab.com
- How did the incident impact customers? 500 errors
Timeline
2019-03-13
- 2019-03-13 RC5 is abandoned because of some uncertainty around the CDN loading of emojis https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/26102 we added a additional header for CORS but it caused problems first in staging, then on production. production#724 (closed)
- 2019-03-13 The header change that caused the 500s is backed out but is not fully backed out do the version not being reverted on production, it looks like the cookbook publisher is broken although more could have been done to ensure this was backed out completely and committed to the chef-repo.
2019-03-14
- 2019-03-14 RC6 is abandoned because the workaround to not use the CDN for emojis caused an issue with canary, where requests were being sent to the main fleet from the frontend where the asset wasn't available. For this reason the entire change will be backed out for the next RC.
- 2019-03014 11:28:00 UTC - Chef runs start that adds the header addition to production because the reverted change from the 13th were not fully backed out
- 2019-03-14 11:28:41 UTC - error rate increased
- 2019-03-14 11:58:41 UTC - issue mitigated
What can be improved
Start with the following:
- Using the root cause analysis, explain what can be improved to prevent this from happening again.
- Is there anything that could have been done to improve the detection or time to detection?
- Is there anything that could have been done to improve the response or time to response?
- Is there an existing issue that would have either prevented this incident or reduced the impact?
- Did we have any indication or beforehand knowledge that this incident might take place?
Corrective Actions
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6385 - Return corresponding error codes from haproxy
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6388 - cookbook publishing for haproxy is broken
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5491 - run the publishing step on the same instance we push to
Guidelines
Edited by John Jarvis