RCA: SSO enforcement feature breaking pipelines
Summary
Enabling the feature enforced_sso_requires_session
was making groups inaccessible for members and ci pipelines fail for customers using SAML: https://gitlab.com/gitlab-org/gitlab-ee/issues/11704
Service(s) affected : ~"Service:CI Runners" Team attribution : Backend Manage Minutes downtime or degradation : 266
Impact & Metrics
What was the impact of the incident?
Customers using SAML/SSO for authentication could not access groups, even though they had proper permissions. Both the users attempting to authorize views in the UI and the ci-runners owned by those encountered failures.
Who was impacted by this incident?
Any customer using SAML/SSO authentication.
How did the incident impact customers?
Customers browsing the UI received 404s to obfuscate the existence of group paths. And customers' ci-runners failed with 403 errors. While the feature flag was set, it was only after customers re-authenticated with their SSO provider did their projects become visible again.
How many attempts were made to access the impacted service/feature?
There was only a moderate raise of 404 errors on the GroupsController
between 9 and 15. Compared to the total amount of 404s that's indicating that not many users have been affected.
How many customers were affected?
4 customers reported the issue—one of them had 90 users affected. At the peak of the issue we encountered nearly 400 errors during a 30 minute period. Given our natural error rate, we estimate that < 10% of these were legitimate errors.
Detection & Response
How was the incident detected?
Customers began reporting issues to support via Zendesk.
- https://gitlab.zendesk.com/agent/tickets/122122
- https://gitlab.zendesk.com/agent/tickets/122134
- https://gitlab.zendesk.com/agent/tickets/122145
Did alarming work as expected?
We received no alerts for this issue, and the error rate (404s, 403s) was too low to be visible on any dashboards or sentry.
How long did it take from the start of the incident to its detection?
After 23m the first customer reported via Zendesk.
How long did it take from detection to remediation?
243 minutes.
Were there any issues with the response to the incident?
Yes. It took 2h33m from first customer report to response, which resulted in gitlab-org/gitlab-ee/issues/11704. Dialogue in that issue did not include much feedback from customers, because the initial conversations took place in Zendesk.
Additionally, it proved difficult for @markpundsack to open an incident in the Infrastructure departments production tracker. Which unnecessarily delayed the page to the Reliability Engineer on call. Mark cited confusion in the handbook's language for Incident Management as the source of his confusion. Finally, once he did find instructions to use the /start-incident
command in the #incident-management
Slack channel, the imoc-bot
received a 404 error and failed to create an incident issue (https://gitlab.slack.com/archives/CB7P5CJS1/p1558534349190400) in the production tracker for the SRE on call.
Timeline
See incident ticket: production#840 (closed).
Root Cause Analysis
- Chatbot permitted a feature flag to flip during an ongoing production deployment.
- The line of communication for escalation from support did not include Reliability Engineering.
- Documentation for engaging Reliability Engineering was difficult to interpret and follow.
- Metrics and monitoring did detect the issue—it was an incredibly low number relative to all 40X errors—though it was anomalous.
What went well
- Support did a great job at creating the gitlab-ee issue and pointing all reporting customers to it so information could be centralized.
- After escalating the issue to an incident the cause was found within 1 minute by @stanhu and mitigated by @ahanselka within 5 minutes.
What can be improved
- We will improve the the feature flag change process for better observability and awareness.
- We will be more responsive in the the gitlab-ee issue so customers know we are working on it—before an incident issue has been created.
- Creating and escalating incidents should be as easy as possible for everybody and well documented.
- We should broaden incident management training outside of Reliability Engineering for exposure to other organizations at GitLab.
Corrective actions
- https://gitlab.com/gitlab-org/release/framework/issues/335
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6765
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6766
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6770
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6773
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6772
- https://gitlab.com/gitlab-org/gitlab-ee/issues/11757