Skip to content

Incident Review: 2024-06-25: Error Code A1001 on Duo Chat When Summarizing Issues/Epics

Key Information

Metric Value
Customers Affected 550
Requests Affected 2553 (not counting job retries)
Incident Severity severity3
Start Time 2024/06/25 12:13 UTC
End Time 2024/06/27 17:35 UTC
Total Duration 2 days 7 hours
Link to Incident Issue https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18191

Summary

The security fix https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/4104 caused Duo Chat to give error messages after answer is shown.

Details

Cause

The security fix is ensure group SSO check happens during chat is processed asynchronously, and this requires making session available inside the Sidekiq environment.

We accomplished this by:

  1. Passing session_id when scheduling jobs. This is obtained by session.id.private_id. Note that session is an ActionDispatch::Request::Session.
  2. The job uses the session_id to query the session information from Redis, which is a Hash. We set this hash as the session itself.

The reason the hash is set as the sesison in Sidekiq is that:

  1. SSO checker treats session as a hash
  2. I was not able to find a way to cast the hash back into ActionDispatch::Request::Session

When testing this locally, it works as expected.

However on production, this became an issue, because there is one additional job that is scheduled outside the development environment. Duo Chat's CompletionWorker would itself schedule another CompletionWorker to analyze the chat. Since this is done inside Sidekiq, the session is a Hash, but the scheduling logic expects it to be an ActionDispatch::Request::Session, just the exception is raised.

The reason this was not observed previously is because in development environment, the nested sidekiq job is skipped. Therefore this bug was not observable locally.

Outcomes/Corrective Actions

  • Avoid gating the logic using Rails.env.development? check
  • Consider asking groupauthentication or groupauthorization team to work or collaborate on issues if the solution mostly requires their domain knowledge
  • Instead of replacing the answer with error, keep the streamed answer and display the error as a separate UI item.
  • Check to see if this is checked in QA. If it does, yet is passing, it may have to do with this bug's behavior: the chat will first show the proper answer, then after a few more seconds, the error message would replace the answer.

What went well?

  1. We were able to iterate and find a temporary solution which requires less review time

Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

The DRI for the incident review is the issue assignee.

  • Set the title to Incident Review: (Incident issue name)
  • Assign a Service::* label (most likely matching the one on the incident issue)
  • Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager)
  • Announce the incident review in the incident channel on Slack.
:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.
Edited by Mark Chao