2025-03-11: AI Gateway: error rate SLO violation
AI Gateway: error rate SLO violation (Severity 3 (Medium))
Problem: The AI Gateway is experiencing an error rate SLO violation due to tightening of the error rate SLO.
Impact: Customers may receive 5xx errors to their code completion requests. Other AI endpoints may also experience spurious 5xx errors, but due to the lower request volume we have not observed this.
Causes: AI Gateway uses the "pay as you go" capacity allocation of Vertex AI (https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429). This resource allocation model results in spurious HTTP 429 errors ("too many requests"). AI Gateway reports this as an HTTP 5xx error to the customer, which is counted against AI Gateway's error budget. Previously, the SLO was lax enough to "eat up" the spurious error rate. With the tightening of the SLO, the pre-existing error rate now creates an SLO violation.
Response strategy: revert the SLO change to silence the alerts. As a follow-up, ask product to decide whether the less strict SLO is acceptable or whether we need to pay for guaranteed Vertex AI capacity.
This ticket was created to track INC-188, by incident.io