Incident Review: 2024-06-25: Error Code A1001 on Duo Chat When Summarizing Issues/Epics

Key Information

Metric	Value
Customers Affected	550
Requests Affected	2553 (not counting job retries)
Incident Severity	severity3
Start Time	2024/06/25 12:13 UTC
End Time	2024/06/27 17:35 UTC
Total Duration	2 days 7 hours
Link to Incident Issue	https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18191

Summary

The security fix https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/4104 caused Duo Chat to give error messages after answer is shown.

Details

2024-06-25T12:13:00Z first case occurs
2024-06-26T05:09:16.878Z MR https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/4215 created for review
2024-06-26T12:42:22.894Z MR 2 https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/4216 created for review. This is because we found the root cause and can make the change simpler
2024-06-26T13:41:27.012Z MR 3 gitlab-org/gitlab!157472 (merged) created for review. This is a quick fix since I anticipated MR 2 may take some time to merge.
2024-06-26T17:19:45.330Z MR 4 created for review (by @lesley-r, thank you!) gitlab-org/gitlab!157508 (merged), this is a separate bug which is discovered here and is a required fix for this incident
2024-06-26T21:09:07.032Z MR 3 merged (delayed due to flaky pipeline failures)
2024-06-26T21:38:29.112Z MR 4 merged
2024-06-27T17:03:31Z production deployment started
2024-06-27T17:35:47.477Z last error
2024-06-27T17:53:52Z production deployment completed

Cause

The security fix is ensure group SSO check happens during chat is processed asynchronously, and this requires making session available inside the Sidekiq environment.

We accomplished this by:

Passing session_id when scheduling jobs. This is obtained by session.id.private_id. Note that session is an ActionDispatch::Request::Session.
The job uses the session_id to query the session information from Redis, which is a Hash. We set this hash as the session itself.

The reason the hash is set as the sesison in Sidekiq is that:

SSO checker treats session as a hash
I was not able to find a way to cast the hash back into ActionDispatch::Request::Session

When testing this locally, it works as expected.

However on production, this became an issue, because there is one additional job that is scheduled outside the development environment. Duo Chat's CompletionWorker would itself schedule another CompletionWorker to analyze the chat. Since this is done inside Sidekiq, the session is a Hash, but the scheduling logic expects it to be an ActionDispatch::Request::Session, just the exception is raised.

The reason this was not observed previously is because in development environment, the nested sidekiq job is skipped. Therefore this bug was not observable locally.

Outcomes/Corrective Actions

Avoid gating the logic using Rails.env.development? check
Consider asking groupauthentication or groupauthorization team to work or collaborate on issues if the solution mostly requires their domain knowledge
Instead of replacing the answer with error, keep the streamed answer and display the error as a separate UI item.
Check to see if this is checked in QA. If it does, yet is passing, it may have to do with this bug's behavior: the chat will first show the proper answer, then after a few more seconds, the error message would replace the answer.

What went well?

We were able to iterate and find a temporary solution which requires less review time

Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

The DRI for the incident review is the issue assignee.

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager)
Announce the incident review in the incident channel on Slack.

:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

Edited Jul 05, 2024 by Mark Chao