SSO redirect issues for private groups
Incident Review
Issue originally began at 5:40pm EST (2023-04-03) with the deployment of a bug fix gitlab-org/gitlab!114111 (merged) that inadvertently introduced a bug in how SSO configured group users are redirected to their SSO login pages. They were instead directed to general GitLab.com login page. A workaround included Admins sharing the login URL for their groups to their users directly, or the users trying to login with email/password and get redirected to their SSO page.
The initial support tickets were raised around 11:00pm EST indicating an issue. An MR with the fix was created but it was determined to be safer to revert the original change and was merged at 11:30am EST. The issue stemmed from documented behaviour for SSO enforcement for various configurations (public, private, members with SAML/without SAML) being difficult to parse for team members not fully familiar with the area.
Timeline:
- 5:40pm EST Original Bug fix MR deployed
- 9:47pm EST First support ticket reported by one of the customers.
- 11:04pm EST support outreach to auth engineering to confirm SSO behaviour.
- 11:58pm - 1:35am EST Auth engineering validated the unexpected behaviour.
- 4:17am EST EMEA auth engineer indicated they have a fix MR getting finalized
- 10:14am EST Quick discussion on whether it's faster to merge the fix or revert original change
- 11:15am EST Support indicated influx of issues.
- 11:21am EST Incident declared and revert MR merged to expedite
- 11:55am EST Users informed of a workaround after initial status update
- 3:19pm EST Revert MR deployed to prod and incident is resolved
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers).
Users with SSO group SAML integrations who are attempting to login to groups and projects + some internal GitLab employees -
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...).
The users without a session who tried to access a group with SSO enforcement enabled were redirected to the gitlab.com login page instead of the sso one. As support tickets were raised, the workaround was provided for admins to directly share the fully qualified for their SAML loginhttps://gitlab.com/groups/<groupname>/-/saml/sso?token=<token>
that would allow the users to login with SSO. -
How many customers were affected?.
We don't have exact numbers but these numbers can help us to understand better the impact:
- According to the database, around 2000 groups have sso enforced.
- 12 customers raised the issue to the support team
-
more than 300 000 requests to
GroupsController
were redirected tohttps://gitlab.com/users/sign_in
- significant decrease can be seen after the fix is on production
- when we extend the interval on logs, we can see that the count of redirects was almost 3times higher than normally
- based on these numbers we can guess that around 200 000 requests that were redirected to gitlab sign-in page were caused by the incident
What were the root causes?
- Issue originally began at 5:40pm EST (3/4/23) with the deployment of a bug fix gitlab-org/gitlab!114111 (merged) that inadvertently introduced another bug in how SSO configured group users are redirected to their SSO login pages when accessing private groups. They were instead directed to general GitLab.com login page.
Incident Response Analysis
-
How was the incident detected?.
The initial support ticket were raised around 11:00pm EST indicating an issue. This was initially investigated as a standalone problem until an incident was raised due to influx of support tickets. An MR with the fix was created but it was determined to be safer to revert the original change and was merged at 11:30am EST the following morning. -
How could detection time be improved?.
The problem was surfaced via the increased number of support tickets and an incident raised to bring attention to the problem along with communicating the status to other users. This was done about 6 hours later than the first support outreach to dev and raising the incident sooner may have allowed us to revert the problematic MR sooner. Better documentation and understanding of the desired SSO behaviour would also lead to detecting variance from correct SSO flow sooner. -
How was the root cause diagnosed?.
The root cause was diagnosed by looking at recent changes on SSO bug fixes/MRs and identifying a change set that modified whether logged out users with SSO enforcement were prompted for redirect as logged in users (vs logged out users). - How could time to diagnosis be improved? After the first report of a customer issue, the problem was diagnosed within the following hour and a fix was started. The delay was primarily due to understanding the number of impacted users as opposed to time spent on diagnosis.
- How did we reach the point where we knew how to mitigate the impact? Once sufficient user impact was reported, the auth engineering team opted to revert the bug fix MR that introduced the issue instead of making additional code changes to mitigate the problem. This was done in active collaboration with support on call.
- How could time to mitigation be improved?. The current time to deployment for the application code was approximately 6 hours despite being expedited by release managers as they were informed of the incident and the impact.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?.
While we have had an incident within SSO in the past that was due to a different root cause. - Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident? There aren't items in backlog that would have prevented or greatly impacted this incident.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Yes, it was triggered by a code change - Don't enforce SSO for public groups (gitlab-org/gitlab!114111 - merged)
What went well?
Support team had great collaboration with auth engineering on initial reporting of the issue and identifying the cause. Similarly it was correctly escalated to an incident when there was a larger influx of support tickets. We were also able to provide a workaround for users to reduce the impact.