2021-11-10 - SAML SSO Redirects are failing with ERR_TOO_MANY_REDIRECTS
Workaround
For any users facing a "too many redirect" error in-browser when navigating to your gitlab.com namespace (eg. https://gitlab.com/customer-name
), you should be able to successfully authenticate directly via your IdP.
For example, if you open your Okta app / browser plugin and click on the configured GitLab application, you'll be able to sign in successfully.
Current Status
The feature causing the issue was turned off via a Feature Flag and the issue is resolved. A Root Cause Analysis with Corrective Actions has been conducted at: gitlab-org/gitlab#345417 (closed).
Summary for CMOC notice / Exec summary:
- Customer Impact: Group SAML SSO
- Customer Impact Duration:
- Gradual rollout (10%-50%) 2021-11-09 11:49 UTC - 2021-11-10 16:51 ( 1742 minutes )
- 100% rollout 2021-11-10 16:51 - 2021-11-10 18:35 ( 104 minutes )
- Current state: See IncidentResolved
- Root cause: gitlab-org/gitlab!69448 (merged)
Timeline
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
All times UTC.
2021-11-04
-
09:17
– Rolled outmasked_url
FF to staging for testing gitlab-org/gitlab#340181 (comment 724404626)
2021-11-09
-
11:49
- Rolling out for 10% on production -
12:00
- 302 messages begin to appear in the logs for a specific customer. -
17:12
- Rolling out for 25% on production
2021-11-10
-
15:33
– Rolling out for 50% on production -
16:51
- Rolled out to 100% on production -
17:00
- 302 redirect messages really begin to spike. -
17:34
-@jrreid
declares incident in Slack. -
17:34
- Incident Issue got createdhttps://gitlab.com/gitlab-com/gl-infra/production/-/issues/5904
-
18:35
– Feature Flag got rolled back to rule it out as the issuehttps://gitlab.com/gitlab-com/gl-infra/production/-/issues/5904#note_729974330
-
18:49
– Confirmation that Feature Flag caused the incidenthttps://gitlab.com/gitlab-com/gl-infra/production/-/issues/5904#note_729986830
Corrective Actions
- Add automated QA tests for group SAML and Snowplow integration enabled/disabled
- Improve rollout issue with explicit language for enabling a feature flag globally
- Improve rollout issue language to specify when a feature is applicable
- Improve auth to fail fast with a descriptive error
- Enable certain features by default for dev and testing
- Fix SAML SSO redirects for pseudonymized URLS for Snowplow tracking (Will be taken over by ~"group::product intelligence" to fix the issue and be able to roll out the pseudonymization of URLs)
- Refactor auth of project and group
- Monitor 302 redirects
- Proposal to use groups SAML for gitlab-com group
- Scan for feature flag changes in more paths
- Feature flag visibility
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- External Customer
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Not being able to sign in via Group SAML SSO Login
-
How many customers were affected?
- 489 groups estimated
- 21,824 unique client IDs estimated
What were the root causes?
- Side effect of calling
group
that got introduced via gitlab-org/gitlab!69448 (merged) which is skipped in theSamlController
Incident Response Analysis
-
How was the incident detected?
- Customer Request & Increase of 302 HTTP responses
-
How could detection time be improved?
- Detect anomalies like the 302 redirects
-
How was the root cause diagnosed?
- Checking recently changed feature flags
-
How could time to diagnosis be improved?
- By spreading information about flag flag changes and rolling them back even if we don't know if they are responsible for the issue.
-
How did we reach the point where we knew how to mitigate the impact?
- After turning off the feature flag & investigating the code path causing the issue
-
How could time to mitigation be improved?
- Dogfood SAML SSO
-
What went well?
- Using a feature flag gave us the chance to immediately resolve the production issue
- The on-call team worked well together by identifying the cause within 1h. The correlation of the feature flag and the 302 redirects are hard to detect. Which was a great call by
@jrreid
- Engineers made great suggestions how we could have improved and prevented this shortly after the incident. Special call out to
@stanhu
and@smcgivern
.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
- Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Feature Flag change gitlab-org/gitlab#340181 (closed)
Lessons Learned
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)