Incident review: 2024-04-17: 500 errors on SAML login (17856)
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics - If there is a need to schedule a synchronous review, complete the following steps:
-
In this issue, @
mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue. -
Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it. -
Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.
-
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- External customers with SAML login with large group size (many subgroups).
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Customers with SAML login enabled could see 500 errors when users attempt to login.
-
How many customers were affected?
- Cannot be exactly determined, we had 3000 failed requests, estimating around affected 400-3000 users
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- 3000 failed requests, about 10-13% of the login requests failed.
What were the root causes?
A merge request (2024-04-17: 500 errors on SAML login (#17856 - closed)) introduced a trivial query change which caused a database query plan flip. Depending on the estimated group size the database picked a different execution plan which was not efficient enough and caused database statement timeouts.
More context and analysis can be found here: #17856 (comment 1868634963)
Incident Response Analysis
-
How was the incident detected?
- Customer reports.
- Elevated error rate on the SAML login endpoint.
-
How could detection time be improved?
- Lower the alerting threshold for
Groups::OmniauthCallbacksController
(500 errors)
- Lower the alerting threshold for
-
How was the root cause diagnosed?
- Based on the slack logs, it seems like the timing out query was detected from kibana.
- After that, the potential MR introducing the query change was found.
-
How could time to diagnosis be improved?
- Add extra alert for this particular endpoint (SAML login)
-
How did we reach the point where we knew how to mitigate the impact?
- Once the newly deployed MR was found, seemed like the most sensible thing to do.
-
How could time to mitigation be improved?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Query plan flip do happen from time to time at various places in the application.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Not that I know of, detecting such problems are quite difficult. A small subquery change might affect hundreds of queries. Finding, testing and detecting these queries is not straightforward.
- The closest item I found is an architecture blueprint: https://docs.gitlab.com/ee/architecture/blueprints/database/automated_query_analysis/
Potential corrective actions
- Check what kind of monitoring alerting we have to check SAML related errors (500). If we have something in place, lower the detection threshold. (alert volume review)
- Document the behavior in our dev docs: Document `SELECT 1 + LIMIT 1` query issues with... (gitlab-org/gitlab#457010 - closed)
- Automated query analysis might be able to help here: https://docs.gitlab.com/ee/architecture/blueprints/database/automated_query_analysis/
What went well?
- ...
Guidelines
Edited by Adam Hegyi