You need to sign in or sign up before continuing.

2020-05-29: HTTP 401s on various components of the GitLab UI

Reports from @fcatteau @ahmadsherif and @andrewn of 401 errors from MR pages.

Summary

401s on user actions from the MR page

Actions such as MR approval or manual pipeline runs are resulting in 401s.

Timeline

All times UTC.

2020-05-29

08:19 - Canary deploy beings
09:07 - andrew declares incident in Slack using /incident declare command.
09:15 - determined this was affecting Canary only.
09:19 - Canary is drained (internal slack: https://gitlab.slack.com/archives/C101F3796/p1590743945437300)
09:19 - Dev escalation engaged (internal slack: https://gitlab.slack.com/archives/CLKLMSUR4/p1590743978156800)
09:24 - Canary deploy completes - though not serving any traffic at this moment in time
14:04 - No one is able to reproduce the issue locally nor in our staging environment
16:32 - Release Manager initiates action to re-enable canary with limited forced access
- Chef roles are modified to prevent some URL's from being forced to Canary
- Communication is broadcast indicating we are troubleshooting Canary with a link instructing users how to prevent the use of Canary
17:03 - Canary is re-enabled
18:21 - Suspicion is that routing between canary and main stages is not properly handled (internal slack: https://gitlab.slack.com/archives/C101F3796/p1590776476474400)
18:37 - We discover that we send CSRF requests during graphQL queries (internal slack: https://gitlab.slack.com/archives/C101F3796/p1590784652494100?thread_ts=1590771808.454900&cid=C101F3796)
18:56 - We discover the rails upgrade contained changes to the CSRF token format:
- gitlab-org/gitlab#219478 (comment 352170284)
- (internal slack: https://gitlab.slack.com/archives/C101F3796/p1590784421492700?thread_ts=1590771808.454900&cid=C101F3796)
19:03 - Proposal to perform a revert of just the rails upgrade and test on the current auto-deploy branch
21:25 - New package completes it's deploy to canary
21:30 - Due to weekend, instead of trying to wiggle the rails upgrade in place, it's decided to proceed to revert on master allowing us to re-evaluate putting this into place at a later time.

2020-05-30

00:44 - Revert gets merged: gitlab-org/gitlab!33446 (merged)
00:12 - Production deploy beings
02:00 - Production deploy completes

Incident Review

Recording of the RCA: https://www.youtube.com/watch?v=MnjR6nr8Uwk

Summary

We attempted to upgrade rails, which contained a change to the CSRF token format. This change is not backward compatible and when we deployed to canary, some users were unable to complete requests depending on where that request went. If the user received the new formmated token, requests that landed on non canary servers would fail as they are unable to be authenticated properly. Requests that relied on cookies went to the correct production stages and would not have been impacted. However, not all frontend components respect this cookie. GraphQL is one example, and any request that utilize this, if going to the wrong stage, would fail.

Service(s) affected: GitLab.com
Team attribution: Configure
Minutes downtime or degradation: Approximately 10 minutes

Metrics

Customer Impact

Who was impacted by this incident? Any users who might have first attempted to use a canary instance, knowingly or not, and then attempted to access non canary endpoints requiring authentication.
What was the customer experience during the incident? Various portions of the interface would fail to load.
How many customers were affected? Unknown
If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

How was the event detected? Internal users discovered repeated HTTP401's
How could detection time be improved?
How did we reach the point where we knew how to mitigate the impact? Deploy was in the process of going out, disabling canary is standard procedure
How could time to mitigation be improved?

Post Incident Analysis

How was the root cause diagnosed?

Changes to our infrastructure configuration. We first disabled all URL paths that force users to our canary instances. We published communications instructing users on how to disable canary if they are to run into issues. We then enabled canary which allowed our Engineers to attempt testing the reproduction of the issue.

How could time to diagnosis be improved?
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?

Yes. gitlab-org/gitlab!32486 (merged)

5 Whys

Lessons Learned

Corrective Actions

Guidelines

Blameless RCA Guideline

Edited Jul 03, 2020 by 🤖 GitLab Bot 🤖