Incident Review: 500 error when sign-in with SAML on GitLab.com

Incident Review

Starting 5:30 UTC 28-2-2024, users for certain providers that integrate using SAML started experiencing 500 error on GitLab.com. The cause was identified as a security improvement that filters out SAML response content from logs. The parser for the SAML response to filter unnecessary attributes generated the error preventing these users to sign in. The offending code change was reverted and deployed by 19:00 UTC 28-2-2024. The impact was for about 16 groups on GitLab.com and 8200 login attempts (which includes reattempts by users). A temporary workaround was to disable SSO.

The DRI for the incident review is the issue assignee.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
If there is a need to schedule a synchronous review, complete the following steps:
- In this issue, @ mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue.
- Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it.
- Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers).
Customers of GitLab.com that use SAML SSO with a subset of providers such as Cisco Duo, Akamai MFA, Keycloak, NetIQ, PingOne MFA. Okta and Azure were not affected.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...) Customers would have see a 500 error when trying to sign in, and would not have had access to their SSO enabled GitLab groups or it's contents.
How many customers were affected?.
1052 Users from 16 groups on GitLab.com were affected, with about 8278 sign in calls that received the 500 response based on https://log.gprd.gitlab.net/app/r/s/RYZkQ
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?.
A fairly accurate impact to customers is known in 4.

What were the root causes?

A code change intended as a security improvement filters out portions of the SAML response logged on GitLab.com logging. This requires parsing the response to redact portions of the responses. The parser didn't expect the payload format from certain providers and that led to generating an error/500 during sign in completion. https://log.gprd.gitlab.net/app/r/s/VLhZ7

Incident Response Analysis

How was the incident detected?.
The incident was originally reported as a support ticket/emergency at 5:30 UTC that support team investigated with groupauthentication . Due to the impact, it was raised as an incident however revert rollout was impacted by another incident in flight.
How could detection time be improved?.
The problem was detected via customer reports and synthetic checks for common providers would help detection time
How was the root cause diagnosed?.
The root cause was diagnosed by support on a sync call, by identifying the error customer was experiencing and correlating that with recent changes in the area.
How could time to diagnosis be improved?.
The time to diagnoses was 11min between when customer emergency was first triggered and when the support engineer identified the cause which is a fairly short time. Advance notice of such changes to support may have helped reduce this further but may generate too much noise.
How did we reach the point where we knew how to mitigate the impact?.
During the emergency ticket evaluation, the mitigation path was identified as reverting the code change. The revert was merged at 6:27 UTC, within the next hour of the report but was blocked from deployment due to 2024-02-28: gstg-cny cannot list resource "secr... (#17674 - closed)

How could time to mitigation be improved?
- Feature flag around these changes could have mitigated the incident significantly faster as they did not need a code change to be deployed.
- Including the RM in the incident and applying blocks deployments would have allowed us to rollback to the last package deployed which would have been faster than trying to get reverted changes built/deployed.

Post Incident Analysis

Did we have other events in the past with the same root cause?.
We have not had logging payload related issues in the past
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?.
Alerts on sign-in attempt volume may have helped in this scenario but outside of that, we don't have current issues in the backlog to mitigate such as issue.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue..
The incident was triggered by a code change SAML response in logs allow replay attacks (gitlab-org/gitlab#435308 - closed)

What went well?

Support team was able to identify the root cause very quickly after the customer raised ticket.
The author of the original change created and merged the revert MR within the hour.
The incident management team were able to gauge the impact of the issue and provided a temporary workaround to affected users.
The issue did not affect providers with greater usage e.g Okta, Azure and was instead applicable to a subset of SAML providers (still disruptive for those users)

Guidelines

Blameless RCA Guideline

Edited Mar 04, 2024 by Adil Farrukh