Incident Review: 2024-06-25: Error Code A1001 on Duo Chat When Summarizing Issues/Epics
Key Information
Metric | Value |
---|---|
Customers Affected | 550 |
Requests Affected | 2553 (not counting job retries) |
Incident Severity | severity3 |
Start Time | 2024/06/25 12:13 UTC |
End Time | 2024/06/27 17:35 UTC |
Total Duration | 2 days 7 hours |
Link to Incident Issue | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18191 |
Summary
The security fix https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/4104 caused Duo Chat to give error messages after answer is shown.
Details
- 2024-06-25T12:13:00Z first case occurs
- 2024-06-26T05:09:16.878Z MR https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/4215 created for review
- 2024-06-26T12:42:22.894Z MR 2 https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/4216 created for review. This is because we found the root cause and can make the change simpler
- 2024-06-26T13:41:27.012Z MR 3 gitlab-org/gitlab!157472 (merged) created for review. This is a quick fix since I anticipated MR 2 may take some time to merge.
- 2024-06-26T17:19:45.330Z MR 4 created for review (by @lesley-r, thank you!) gitlab-org/gitlab!157508 (merged), this is a separate bug which is discovered here and is a required fix for this incident
- 2024-06-26T21:09:07.032Z MR 3 merged (delayed due to flaky pipeline failures)
- 2024-06-26T21:38:29.112Z MR 4 merged
- 2024-06-27T17:03:31Z production deployment started
- 2024-06-27T17:35:47.477Z last error
- 2024-06-27T17:53:52Z production deployment completed
Cause
The security fix is ensure group SSO check happens during chat is processed asynchronously, and this requires making session available inside the Sidekiq environment.
We accomplished this by:
- Passing session_id when scheduling jobs. This is obtained by
session.id.private_id
. Note that session is anActionDispatch::Request::Session
. - The job uses the session_id to query the session information from Redis, which is a Hash. We set this hash as the session itself.
The reason the hash is set as the sesison in Sidekiq is that:
- SSO checker treats session as a hash
- I was not able to find a way to cast the hash back into ActionDispatch::Request::Session
When testing this locally, it works as expected.
However on production, this became an issue, because there is one additional job that is scheduled outside the development environment. Duo Chat's CompletionWorker
would itself schedule another CompletionWorker
to analyze the chat.
Since this is done inside Sidekiq, the session is a Hash, but the scheduling logic expects it to be an ActionDispatch::Request::Session
, just the exception is raised.
The reason this was not observed previously is because in development environment, the nested sidekiq job is skipped. Therefore this bug was not observable locally.
Outcomes/Corrective Actions
- Avoid gating the logic using
Rails.env.development?
check - Consider asking groupauthentication or groupauthorization team to work or collaborate on issues if the solution mostly requires their domain knowledge
- Instead of replacing the answer with error, keep the streamed answer and display the error as a separate UI item.
- Check to see if this is checked in QA. If it does, yet is passing, it may have to do with this bug's behavior: the chat will first show the proper answer, then after a few more seconds, the error message would replace the answer.
What went well?
- We were able to iterate and find a temporary solution which requires less review time
Guidelines
This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.
The DRI for the incident review is the issue assignee.
-
Set the title to Incident Review: (Incident issue name)
-
Assign a Service::*
label (most likely matching the one on the incident issue) -
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) -
Announce the incident review in the incident channel on Slack.
:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.