Corrective actions for Some users of AI may get 401 unauthorized
Description
This is the follow-up issue for corrective actions required for this incident.
In summary, we recently rotated the JWT signing key for CustomersDot, see https://gitlab.com/gitlab-org/customers-gitlab-com/-/issues/7112+s. While the key rotation was successful, we discovered that the JWKS cache in the AI Gateway needs to be invalidated immediately when such changes go live. This resulted in some users of AI features getting 401 unauthorized errors.
To resolve this, we are taking the following actions:
- [-]
Roll back the key rotation change (MR https://gitlab.com/gitlab-org/customers-gitlab-com/-/merge_requests/10562+s) -
Add environment variable ROTATE_PODS_INCIDENT_18348=true
to each AIGW deployment to quickly rotate the pods -
Operational runbook for quickly rotating instances (gitlab-com/gl-infra/platform/runway/team#331 - closed) -
https://gitlab.com/gitlab-org/customers-gitlab-com/-/issues/10392+ - [-]
Expose latest fetched cdot metadata in aigw as a metric, potentially with a numeric revision label.- I removed this because I don't think this is going to help. We already have logging in place now to see the exact state in the AI gateway as regards JWKS updates. - [-]
https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/584+ -
https://gitlab.com/gitlab-org/gitlab/-/issues/496509+ -
https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/640+
Once these actions are complete, we expect the 401 errors to cease and AI features to function normally again.
Edited by Aleksei Lipniagov