2020-05-29: HTTP 401s on various components of the GitLab UI
Reports from @fcatteau @ahmadsherif and @andrewn of 401 errors from MR pages.
Summary
401s on user actions from the MR page
Actions such as MR approval or manual pipeline runs are resulting in 401s.
Timeline
All times UTC.
2020-05-29
- 08:19 - Canary deploy beings
- 09:07 - andrew declares incident in Slack using
/incident declare
command. - 09:15 - determined this was affecting Canary only.
- 09:19 - Canary is drained (internal slack: https://gitlab.slack.com/archives/C101F3796/p1590743945437300)
- 09:19 - Dev escalation engaged (internal slack: https://gitlab.slack.com/archives/CLKLMSUR4/p1590743978156800)
- 09:24 - Canary deploy completes - though not serving any traffic at this moment in time
- 14:04 - No one is able to reproduce the issue locally nor in our staging environment
- 16:32 - Release Manager initiates action to re-enable canary with limited forced access
- Chef roles are modified to prevent some URL's from being forced to Canary
- Communication is broadcast indicating we are troubleshooting Canary with a link instructing users how to prevent the use of Canary
- 17:03 - Canary is re-enabled
- 18:21 - Suspicion is that routing between canary and main stages is not properly handled (internal slack: https://gitlab.slack.com/archives/C101F3796/p1590776476474400)
- 18:37 - We discover that we send CSRF requests during graphQL queries (internal slack: https://gitlab.slack.com/archives/C101F3796/p1590784652494100?thread_ts=1590771808.454900&cid=C101F3796)
- 18:56 - We discover the rails upgrade contained changes to the CSRF token format:
- 19:03 - Proposal to perform a revert of just the rails upgrade and test on the current auto-deploy branch
- 21:25 - New package completes it's deploy to canary
- 21:30 - Due to weekend, instead of trying to wiggle the rails upgrade in place, it's decided to proceed to revert on master allowing us to re-evaluate putting this into place at a later time.
2020-05-30
- 00:44 - Revert gets merged: gitlab-org/gitlab!33446 (merged)
- 00:12 - Production deploy beings
- 02:00 - Production deploy completes
Incident Review
Recording of the RCA: https://www.youtube.com/watch?v=MnjR6nr8Uwk
Summary
We attempted to upgrade rails, which contained a change to the CSRF token format. This change is not backward compatible and when we deployed to canary, some users were unable to complete requests depending on where that request went. If the user received the new formmated token, requests that landed on non canary servers would fail as they are unable to be authenticated properly. Requests that relied on cookies went to the correct production stages and would not have been impacted. However, not all frontend components respect this cookie. GraphQL is one example, and any request that utilize this, if going to the wrong stage, would fail.
- Service(s) affected: GitLab.com
- Team attribution: Configure
- Minutes downtime or degradation: Approximately 10 minutes
Metrics
Customer Impact
- Who was impacted by this incident? Any users who might have first attempted to use a canary instance, knowingly or not, and then attempted to access non canary endpoints requiring authentication.
- What was the customer experience during the incident? Various portions of the interface would fail to load.
- How many customers were affected? Unknown
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected? Internal users discovered repeated HTTP401's
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact? Deploy was in the process of going out, disabling canary is standard procedure
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
Changes to our infrastructure configuration. We first disabled all URL paths that force users to our canary instances. We published communications instructing users on how to disable canary if they are to run into issues. We then enabled canary which allowed our Engineers to attempt testing the reproduction of the issue.
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
Yes. gitlab-org/gitlab!32486 (merged)
5 Whys
Lessons Learned
Corrective Actions
- gitlab-com/www-gitlab-com#7945 (closed)
- gitlab-com/www-gitlab-com#7946 (closed)
- gitlab-org/gitlab#219731
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10501
- gitlab-org/gitlab!34116 (merged)
- &261