Automatic SSO enforcement for SAML ID users left public project/group resource access in an inconsistent state

Incident Review

On 17th October 2022, 11:05am EST a feature was rolled out through feature toggle to enforce automatic SSO for users that have a SAML ID available. This allowed orgs to improve their security posture while also letting users not on SSO to access public groups/projects. At 11:07am EST GitLab internal users reported seeing issues with the issue dashboard and inability to comment on threads. On-call staff raised this as a S2 (revised to be a) S3 based on impact production incident with investigation in #incident-7890. The enablement of the feature was announced ahead in support slack and was subsequently disabled at 11:38am EST mitigating the issue.

Timeline

11:05am est: Feature toggle enabled
11:07am est: Issue manually detected and an incident manually triggered.
11:25am est: Users notified on status pages
11:38am est: Feature toggled off and incident mitigated.
12:38pm est: Users notified of complete resolution on status pages
3:30pm est: Root cause identified in application code

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. The incident impacted users of public groups or projects that had SSO enabled. This would include both internal (GitLab employees in this case) and external users.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Users would not have been able to update the issue dashboard or leave comments.
How many customers were affected?
1. Total number of users would be all GitLab employees (~1600) and ~136 external user groups. The external users affected includes test groups so the actual affected number will be lower
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. N/A

What were the root causes?

There was a gap in the SSO enforcement permission policy that did not restrict public access for SAML ID/SSO enforced users. This allowed them to access those groups or projects and experience a mismatch of actions they were allowed to do (between public and restricted resources on screen)

Incident Response Analysis

How was the incident detected?
1. The incident was manually reported at 11:07am EST 17th Oct 2022
How could detection time be improved?
1. The feature toggle transparent_sso_enforcement for the causing implementation was enabled at 11:05am EST and the problem detected at 11:07am EST. I believe the key factor in detection would be automated/synthetic checks to identify issue dashboard interactivity/comments instead of manual reporting (which could take much longer in future)
How was the root cause diagnosed?
1. The DRI for the feature identified that the impacted users are GitLab users with SSO accessing public projects and that we had a gap in policies set for enforcing SSO on public projects. This was highlighted once SSO enforcement was made automatic for all users with a SAML ID but had always existed.
How could time to diagnosis be improved?
1. Outside of automation, actively watching #production by the DRI for FF for any reported issues would help reduce the time to diagnosis. Additionally manually observing any metrics/dashboards around browser state errors would have helped reduce the diagnosis time.
How did we reach the point where we knew how to mitigate the impact?
1. The on call engineers created an incident based on manual user reports. Some of the users reported that they were able to resolve the problem after re-logging on to their SSO. This prompted a reversal of the recently introduced feature toggle as that was shared in the support slack channel ahead of time in case of issues.
How could time to mitigation be improved?
1. The time to mitigate was < 30 mins, with most of it spent on correlating the issue dashboard problems and commenting to SSO. This can be improved by monitoring the production slack for issues after enabling the toggle or synthetic checks against the functionality to identify issues right after a toggle is enabled.
What went well?
1. DRI notified GitLab.com support team prior to and immediately after enabling the feature flag. Flag was also updated in the #production channel so the action was visible. Feature flag rollout issue noted that the flag could be disabled without concern.
2. The issue was reported right away after being seen by GitLab users.
3. On-call staff was able to reproduce the issue and created a product issue within 5 minutes of the first report.
4. Based on the reports, time to mitigate was < 30mins and the DRI was able to identify root cause within 2 hours

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. It was triggered (magnified) by deployment of code

What went well?

GitLab internal users reporting the issue provided details, screen recordings of what they were seeing including the resolution when they re-logged in through the SSO. That combined with on-call staffs quick response to reproduce the issue and coordinate with the CMOC to update status/communications to customers and the availability of the feature toggle worked well to resolve the issue quickly.

Guidelines

Blameless RCA Guideline

Edited Oct 25, 2022 by Adil Farrukh