RCA - SAML SSO Redirects failing with ERR_TOO_MANY_REDIRECTS related to pseudonymize page url and referer url
Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.
Summary
A brief summary of what happened. Try to make it as executive-friendly as possible.
Incident raised: gitlab-com/gl-infra/production#5904 (closed)
- SAML SSO login attempts were stuck in a redirect loop.
- There is a workaround gitlab-com/gl-infra/production#5904 (closed)
- Caused identified as !69448 (merged)
- Incident resolved when feature flag identified and turned off #340181 (closed)
- Service(s) affected : Group SAML SSO
- Team attribution: ~"group::product intelligence"
- Minutes downtime or degradation: SAML SSO was down for approximately 2 hours from the time of 100% enablement of the feature flag to the time of resolution
The Group SSO Controller is skipping the :group
method before it executes. The URL-pseudonymization code that gets included on every page is checking if group
is defined. By calling group
for tracking the pseudonymized URL, it caused an endless redirect.
Impact & Metrics
Start with the following:
Question | Answer |
---|---|
What was the impact | SAML SSO login attempts were stuck in a redirect loop |
Who was impacted | Customers using SAML SSO login |
How did this impact customers | Prevented them from logging into GitLab.com |
How many attempts made to access | 1.8M |
How many customers affected | Estimating 489 groups |
How many customers tried to access | Estimating 21,824 unique clients |
Include any additional metrics that are of relevance.
Provide any relevant graphs that could help understand the impact of the incident and its dynamics.
Detection & Response
Start with the following:
Question | Answer |
---|---|
When was the incident detected? | 2021-11-10 – 17:34 UTC |
How was the incident detected? | Customer Report |
Did alarming work as expected? | No |
How long did it take from the start of the incident to its detection? | 30h from 10% rollout, 40min from 100% rollout |
How long did it take from detection to remediation? | 1 hour |
What steps were taken to remediate? | Rolled back the Feature Flag |
Were there any issues with the response? | No |
MR Checklist
Consider these questions if a code change introduced the issue.
Question | Answer |
---|---|
Was the MR acceptance checklist marked as reviewed in the MR? | Yes, The checklist was marked as reviewed and was followed with points related to review, updating the relevant teams (in case needed). Regarding the tests checklist, the author and reviewer had not reason to believe that this change would impact SAML SSO Integration. |
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so? | This was one-off case which may not occur in case if it wasn’t for enabling snowplow in e2e tests. We think adding any pointers to update the MR review checklist would not be beneficial in any form. |
Timeline (UTC)
2021-11-04
-
09:17
– Rolled outmasked_url
FF to staging for testing #340181 (comment 724404626)
2021-11-09
-
11:49
- Rolling out for 10% on production -
12:00
- 302 messages begin to appear in the logs for a specific customer. -
17:12
- Rolling out for 25% on production
2021-11-10
-
15:33
– Rolling out for 50% on production -
15:00
- 302 redirect messages really begin to spike. -
16:51
- Rolled out to 100% on production -
17:34
-@jrreid
declares incident in Slack. -
17:34
- Incident Issue got createdhttps://gitlab.com/gitlab-com/gl-infra/production/-/issues/5904
-
18:35
– Feature Flag got rolled back to rule it out as the issuehttps://gitlab.com/gitlab-com/gl-infra/production/-/issues/5904#note_729974330
-
18:49
– Confirmation that Feature Flag caused the incidenthttps://gitlab.com/gitlab-com/gl-infra/production/-/issues/5904#note_729986830
Root Cause Analysis
The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.
For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.
Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.
Possible reasons why this incident occured
- This bug was difficult to reproduce locally while development as it occured only in case of SAML SSO integration.
- The MR reviews were not able to catch this particular case as it's unlikely that local development envs do have test groups with SAML SSO integration.
- Even with SAML SSO being enabled locally, a good look at the MR doesn't motivate the need to check for case with SAML SSO integration.
- The Intergation tests had checks with feature flags being toggled on/off, with SAML SSO but did not have snowplow integration in the e2e tests. This partifular code block was executed when feature flag and snowplow integration both were enabled. We would have caught this bug if tests were run with snowplow enabled.
- We didn't have a plan to monitor for an increase in (302) errors after the feature flag was turned on
What went well
- Using a feature flag gave us the change to immediately resolve the production issue
- The on-call team worked well together by identifying the cause within 1h. The correlation of the feature flag and the 302 redirects are hard to detect. Which was a great call by
@jrreid
- Engineers made great suggestions how we could have improved and prevented this shortly after the incident. Special call out to
@stanhu
and@smcgivern
.
What can be improved
- We should be using group SAML since this is not only better for managing team members, but also because enterprise customers rely on it. Maybe we can identify certain groups within the company to have SSO enforcement before we enforce this on the gitlab-com group
- We should be testing group SAML in automated QA tests.
- We should be testing group SAML in combination with Snowplow enabeled/disabled in automated QA tests.
- We should figure out how to improve the auth so we either fail fast with a more obvious error. As discussed in gitlab-com/gl-infra/production#5904 (closed) (comment 730015291), we could consider a hard failure by undefining the group. We could consider setting a flag that detects that we're already attempting auth and raise an exception for any attempts to re-attempt auth.
- We should consider how we might refactor the authentication of project and group to prevent triggering these unexpected side effects in the first place.
- Per gitlab-com/gl-infra/delivery#2112 (closed) , we should improve the language of the feature flag rollout to be more explicit about when to mention #support_gitlab-com and automate the posting when a feature flag has been enabled globally. I suggest that we perhaps strengthen the language to mention the feature flag no matter what, since it's hard to know if something is user-visible.
- Improve rollout issue to specify the cases when a feature is applicable in this case was not obvious that Snowplow should be enabled
- For development or test environments, can we find ways to enable certain features by default? It was not obvious to me that both Snowplow and this mask_page_urls flag were needed to see this problem. Plus, it took us a few minutes to get group SAML working, but I doubt most people even have a test group with SAML set up. Perhaps we need to formalize the issue-reproduce group so that any feature developed can be validated quickly on GitLab.com.
- Is there anything that could have been done to improve the detection or time to detection?
- How could we detect anomalies like the 302 redirects. We already monitor 5xx errors, but not 302
- Is there anything that could have been done to improve the response or time to response?
- .
- Is there an existing issue that would have either prevented this incident or reduced the impact?
- No
- Did we have any indication or beforehand knowledge that this incident might take place?
- No
- Was the MR acceptance checklist marked as reviewed in the MR?
- .
- Should the checklist be updated to help reduce chances of future recurrences?
- .
Corrective actions
- List issues that have been created as corrective actions from this incident.
- For each issue, include the following:
- - Issue labeled as corrective action.
- Include an estimated date of completion of the corrective action.
- Include the named individual who owns the delivery of the corrective action.
Corrective actions issues list:
Development – Product Intelligence
- Add automated QA tests for group SAML and Snowplow integration enabled/disabled
- Improve rollout issue with explicit language for enabling a feature flag globally
- Improve rollout issue language to specify when a feature is applicable
- Improve auth to fail fast with a descriptive error
- Enable certain features by default for dev and testing
Development – Manage::Access
- Fix SAML SSO redirects for pseudonymized URLS for Snowplow tracking (Will be taken over by ~"group::product intelligence" to fix the issue and be able to roll out the pseudonymization of URLs)
- Refactor auth of project and group