RCA: SSO enforcement feature breaking pipelines

Summary

Enabling the feature enforced_sso_requires_session was making groups inaccessible for members and ci pipelines fail for customers using SAML: https://gitlab.com/gitlab-org/gitlab-ee/issues/11704

Service(s) affected : ~"Service:CI Runners" Team attribution : Backend Manage Minutes downtime or degradation : 266

Impact & Metrics

What was the impact of the incident?

Customers using SAML/SSO for authentication could not access groups, even though they had proper permissions. Both the users attempting to authorize views in the UI and the ci-runners owned by those encountered failures.

Who was impacted by this incident?

Any customer using SAML/SSO authentication.

How did the incident impact customers?

Customers browsing the UI received 404s to obfuscate the existence of group paths. And customers' ci-runners failed with 403 errors. While the feature flag was set, it was only after customers re-authenticated with their SSO provider did their projects become visible again.

How many attempts were made to access the impacted service/feature?

There was only a moderate raise of 404 errors on the GroupsController between 9 and 15. Compared to the total amount of 404s that's indicating that not many users have been affected.

How many customers were affected?

4 customers reported the issue—one of them had 90 users affected. At the peak of the issue we encountered nearly 400 errors during a 30 minute period. Given our natural error rate, we estimate that < 10% of these were legitimate errors.

Detection & Response

How was the incident detected?

Customers began reporting issues to support via Zendesk.

Did alarming work as expected?

We received no alerts for this issue, and the error rate (404s, 403s) was too low to be visible on any dashboards or sentry.

How long did it take from the start of the incident to its detection?

After 23m the first customer reported via Zendesk.

How long did it take from detection to remediation?

243 minutes.

Were there any issues with the response to the incident?

Yes. It took 2h33m from first customer report to response, which resulted in gitlab-org/gitlab-ee/issues/11704. Dialogue in that issue did not include much feedback from customers, because the initial conversations took place in Zendesk.

Additionally, it proved difficult for @markpundsack to open an incident in the Infrastructure departments production tracker. Which unnecessarily delayed the page to the Reliability Engineer on call. Mark cited confusion in the handbook's language for Incident Management as the source of his confusion. Finally, once he did find instructions to use the /start-incident command in the #incident-management Slack channel, the imoc-bot received a 404 error and failed to create an incident issue (https://gitlab.slack.com/archives/CB7P5CJS1/p1558534349190400) in the production tracker for the SRE on call.

Timeline

See incident ticket: production#840 (closed).

Root Cause Analysis

Chatbot permitted a feature flag to flip during an ongoing production deployment.
The line of communication for escalation from support did not include Reliability Engineering.
Documentation for engaging Reliability Engineering was difficult to interpret and follow.
Metrics and monitoring did detect the issue—it was an incredibly low number relative to all 40X errors—though it was anomalous.

What went well

Support did a great job at creating the gitlab-ee issue and pointing all reporting customers to it so information could be centralized.
After escalating the issue to an incident the cause was found within 1 minute by @stanhu and mitigated by @ahanselka within 5 minutes.

What can be improved

We will improve the the feature flag change process for better observability and awareness.
We will be more responsive in the the gitlab-ee issue so customers know we are working on it—before an incident issue has been created.
Creating and escalating incidents should be as easy as possible for everybody and well documented.
- We should broaden incident management training outside of Reliability Engineering for exposure to other organizations at GitLab.

Corrective actions

Guidelines

Edited Jun 05, 2019 by Henri Philipps