RCA for High error rate on GitLab.com

Summary

On 2019-03-14 we started to see errors on the GitLab frontend fleet indicating a large number of users were receiving 500s for web requests. The root cause was a header introduced to handle a cross-origin-request via the GitLab.com CDN. This change caused an error on production that was not seen on staging and needed to be reverted on production.

Example endpoints were affected by the front-end:

These services were not impacted as it was specific to the web frontend

git ssh
git https
registry
pages

Service(s) affected : Web Team attribution : Minutes downtime or degradation : 30 minutes

Impact & Metrics

Start with the following:

What was the impact of the incident? Outage for some web endpoints
Who was impacted by this incident? Web users of GitLab.com
How did the incident impact customers? 500 errors

Timeline

2019-03-13

2019-03-13 RC5 is abandoned because of some uncertainty around the CDN loading of emojis https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/26102 we added a additional header for CORS but it caused problems first in staging, then on production. production#724 (closed)
2019-03-13 The header change that caused the 500s is backed out but is not fully backed out do the version not being reverted on production, it looks like the cookbook publisher is broken although more could have been done to ensure this was backed out completely and committed to the chef-repo.

2019-03-14

2019-03-14 RC6 is abandoned because the workaround to not use the CDN for emojis caused an issue with canary, where requests were being sent to the main fleet from the frontend where the asset wasn't available. For this reason the entire change will be backed out for the next RC.
2019-03014 11:28:00 UTC - Chef runs start that adds the header addition to production because the reverted change from the 13th were not fully backed out
2019-03-14 11:28:41 UTC - error rate increased
2019-03-14 11:58:41 UTC - issue mitigated

What can be improved

Start with the following:

Using the root cause analysis, explain what can be improved to prevent this from happening again.
Is there anything that could have been done to improve the detection or time to detection?
Is there anything that could have been done to improve the response or time to response?
Is there an existing issue that would have either prevented this incident or reduced the impact?
Did we have any indication or beforehand knowledge that this incident might take place?

Corrective Actions

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6385 - Return corresponding error codes from haproxy
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6388 - cookbook publishing for haproxy is broken
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5491 - run the publishing step on the same instance we push to

Guidelines

Edited Mar 14, 2019 by John Jarvis