RCA - SAML SSO Redirects failing with ERR_TOO_MANY_REDIRECTS related to pseudonymize page url and referer url

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.

Summary

A brief summary of what happened. Try to make it as executive-friendly as possible.

Incident raised: gitlab-com/gl-infra/production#5904 (closed)

SAML SSO login attempts were stuck in a redirect loop.
There is a workaround gitlab-com/gl-infra/production#5904 (closed)
Caused identified as !69448 (merged)
Incident resolved when feature flag identified and turned off #340181 (closed)
Service(s) affected : Group SAML SSO
Team attribution: ~"group::product intelligence"
Minutes downtime or degradation: SAML SSO was down for approximately 2 hours from the time of 100% enablement of the feature flag to the time of resolution

The Group SSO Controller is skipping the :group method before it executes. The URL-pseudonymization code that gets included on every page is checking if group is defined. By calling group for tracking the pseudonymized URL, it caused an endless redirect.

Impact & Metrics

Start with the following:

Question	Answer
What was the impact	SAML SSO login attempts were stuck in a redirect loop
Who was impacted	Customers using SAML SSO login
How did this impact customers	Prevented them from logging into GitLab.com
How many attempts made to access	`1.8M`
How many customers affected	Estimating 489 groups
How many customers tried to access	Estimating 21,824 unique clients

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

Question	Answer
When was the incident detected?	`2021-11-10 – 17:34 UTC`
How was the incident detected?	Customer Report
Did alarming work as expected?	No
How long did it take from the start of the incident to its detection?	30h from 10% rollout, 40min from 100% rollout
How long did it take from detection to remediation?	1 hour
What steps were taken to remediate?	Rolled back the Feature Flag
Were there any issues with the response?	No

MR Checklist

Consider these questions if a code change introduced the issue.

Question	Answer
Was the MR acceptance checklist marked as reviewed in the MR?	Yes, The checklist was marked as reviewed and was followed with points related to review, updating the relevant teams (in case needed). Regarding the tests checklist, the author and reviewer had not reason to believe that this change would impact SAML SSO Integration.
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so?	This was one-off case which may not occur in case if it wasn’t for enabling snowplow in e2e tests. We think adding any pointers to update the MR review checklist would not be beneficial in any form.

Timeline (UTC)

2021-11-04

09:17 – Rolled out masked_url FF to staging for testing #340181 (comment 724404626)

2021-11-09

11:49 - Rolling out for 10% on production
12:00 - 302 messages begin to appear in the logs for a specific customer.
17:12 - Rolling out for 25% on production

2021-11-10

15:33 – Rolling out for 50% on production
15:00 - 302 redirect messages really begin to spike.
16:51 - Rolled out to 100% on production
17:34 - @jrreid declares incident in Slack.
17:34 - Incident Issue got created https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5904
18:35 – Feature Flag got rolled back to rule it out as the issue https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5904#note_729974330
18:49 – Confirmation that Feature Flag caused the incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5904#note_729986830

Root Cause Analysis

The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.

For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.

Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.

Possible reasons why this incident occured

This bug was difficult to reproduce locally while development as it occured only in case of SAML SSO integration.
The MR reviews were not able to catch this particular case as it's unlikely that local development envs do have test groups with SAML SSO integration.
Even with SAML SSO being enabled locally, a good look at the MR doesn't motivate the need to check for case with SAML SSO integration.
The Intergation tests had checks with feature flags being toggled on/off, with SAML SSO but did not have snowplow integration in the e2e tests. This partifular code block was executed when feature flag and snowplow integration both were enabled. We would have caught this bug if tests were run with snowplow enabled.
We didn't have a plan to monitor for an increase in (302) errors after the feature flag was turned on

What went well

Using a feature flag gave us the change to immediately resolve the production issue
The on-call team worked well together by identifying the cause within 1h. The correlation of the feature flag and the 302 redirects are hard to detect. Which was a great call by @jrreid
Engineers made great suggestions how we could have improved and prevented this shortly after the incident. Special call out to @stanhu and @smcgivern.

What can be improved

We should be using group SAML since this is not only better for managing team members, but also because enterprise customers rely on it. Maybe we can identify certain groups within the company to have SSO enforcement before we enforce this on the gitlab-com group
- We should be testing group SAML in automated QA tests.
- We should be testing group SAML in combination with Snowplow enabeled/disabled in automated QA tests.
- We should figure out how to improve the auth so we either fail fast with a more obvious error. As discussed in gitlab-com/gl-infra/production#5904 (closed) (comment 730015291), we could consider a hard failure by undefining the group. We could consider setting a flag that detects that we're already attempting auth and raise an exception for any attempts to re-attempt auth.
- We should consider how we might refactor the authentication of project and group to prevent triggering these unexpected side effects in the first place.
- Per gitlab-com/gl-infra/delivery#2112 (closed) , we should improve the language of the feature flag rollout to be more explicit about when to mention #support_gitlab-com and automate the posting when a feature flag has been enabled globally. I suggest that we perhaps strengthen the language to mention the feature flag no matter what, since it's hard to know if something is user-visible.
- Improve rollout issue to specify the cases when a feature is applicable in this case was not obvious that Snowplow should be enabled
- For development or test environments, can we find ways to enable certain features by default? It was not obvious to me that both Snowplow and this mask_page_urls flag were needed to see this problem. Plus, it took us a few minutes to get group SAML working, but I doubt most people even have a test group with SAML set up. Perhaps we need to formalize the issue-reproduce group so that any feature developed can be validated quickly on GitLab.com.
Is there anything that could have been done to improve the detection or time to detection?
- How could we detect anomalies like the 302 redirects. We already monitor 5xx errors, but not 302
Is there anything that could have been done to improve the response or time to response?
- .
Is there an existing issue that would have either prevented this incident or reduced the impact?
- No
Did we have any indication or beforehand knowledge that this incident might take place?
- No
Was the MR acceptance checklist marked as reviewed in the MR?
- .
Should the checklist be updated to help reduce chances of future recurrences?
- .

Corrective actions

List issues that have been created as corrective actions from this incident.
For each issue, include the following:
- - Issue labeled as corrective action.
- Include an estimated date of completion of the corrective action.
- Include the named individual who owns the delivery of the corrective action.

Corrective actions issues list:

Development – Product Intelligence

Development – Manage::Access

Fix SAML SSO redirects for pseudonymized URLS for Snowplow tracking (Will be taken over by ~"group::product intelligence" to fix the issue and be able to roll out the pseudonymization of URLs)
Refactor auth of project and group

References

Edited Nov 15, 2021 by Nicolas Dular