2024-05-24: Incident Review: code suggestions returning 403 in prod and staging

Incident Review

The DRI for the incident review is the issue assignee.

Announce the incident review in the incident channel on Slack.

:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
If there is a need to schedule a synchronous review, complete the following steps:
- In this issue, @ mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue.
- Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it.
- Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All users of Code Suggestions on gitlab.com using any editor extension (including Web IDE).
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Attempts to use Code Suggestions would fail and the "Code Suggestions" status would show an error state.
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ~80% of all Code Suggestions users — which we later narrowed down to only GitLab.com users being effected.

What were the root causes?

After introducing Introduce mapping for duo_pro add_on (gitlab-org/gitlab!153711 - merged) • Nikola Milojevic • 17.1 the previous values were being cached causing requests to AI gateway to be missing the required code_suggestions/duo_pro scope.

AI GW started to respond with 403. It's because token auth started to fail. That was because the token didn't have a correct claim.

We didn't supply the correct claim because we changed the logic of how we read the Cloud Connector configuration in the mentioned MR. We also updated the configuration (for .com) accordingly, and the new logic was correct, but we didn't account for the old configuration being cached (CLOUD_CONNECTOR_SERVICES_KEY). Manually resetting the cache saved the day.

A good analogy of the issue would be that it was not a "zero downtime" change. To account for a cache, we should've introduced logic that works both with the old and new configurations, waited for the cache to expire (or did it manually via CR if needed), and then dropped the old logic.

Incident Response Analysis

How was the incident detected?
1. E2E tests for Code Suggestions in the Web IDE started failing - https://gitlab.com/gitlab-org/gitlab/-/issues/463470
2. Reports from several team members that started experiencing 403 Forbidden errors when attempting to use Code Suggestions in Web IDE, VS Code, and JetBrains IDEs.
How could detection time be improved?
1. We could introduce a monitor on an anomalous amount of non-200 requests happening for the /v2/code/completions endpoint in AI gateway.
How was the root cause diagnosed?
1. @alipniagov identified the Duo add-on changes that were recently merged in the monolith.
How could time to diagnosis be improved?
1. Save the Elastic filters used by @francoisrose and others (shared as internal notes in the incident) for when code suggestions related issues come up.
2. Introduce a runbook for determining when authentication/authorization errors are due to changes in our extensions, language server, monolith, or AI gateway.
How did we reach the point where we knew how to mitigate the impact?
1. @shinya.maeda agreed with @alipniagov's assessment of the issue and suggested we cleared the cache based on what we knew to prevent a mix of old/new values in the system.
How could time to mitigation be improved?
1. Severity of the incident and not having an immediate runbook to answer.
2. @erran had to ping @incident-managers in Slack for the incident channel — likely due to it being opened as a severity3 incident.
3. New /chatops commands to verify status of Editor Extensions and AI gateway deployments.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Stale cache / not accounting for cache is a popular problem. Although this is the first time caching Cloud Connector config results in production incident, so the answer is no.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. No, but we prepared a corrective action. Thanks Shinya for opening an MR: gitlab-org/gitlab!154094 (merged) (it is in review currently). Switching from Redis to memorization should eliminate this issue and also allow us to simplify local testing/dev setup of CC config/CI validation. The performance impact is not noticeable.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Yes, by the deployment of gitlab-org/gitlab!153711 (merged)

What went well?

Guidelines

Blameless RCA Guideline

Edited May 28, 2024 by Aleksei Lipniagov