Update AI documentation for the course of action with upstream providers

We need to clarify and update our documentation on the course of action for incidents with upstream providers.

This ticket was created from INC-704 and was automatically exported by incident.io 🔥

The problem was that code completion requests for vertex_ai/codestral-2501 from users of self-managed instances were getting errors like (example):

litellm.RateLimitError: litellm.RateLimitError: VertexAIException - HTTPStatusError - {
  "error": {
    "code": 429,
    "message": "Resource exhausted. Please try again later. Please refer to https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429 for more details.",
    "status": "RESOURCE_EXHAUSTED"
  }

GCP documentation states:

With a Provisioned Throughput subscription, you can reserve an amount of throughput for specific generative AI models. If you don't have a Provisioned Throughput subscription and resources aren't available to your application, then an error code 429 is returned.

The AI Gateway runbook has a section on performance and scalability. They note internal infrastructure utilization metrics and external quotas and rate limits, but not external resource exhaustion.

We should:

Confirm that this is expected behavior given our arragement with Google
- If so, add a note to the GCP quotas section of the runbook noting that this particular error shows up in logs as a rate limit, but it's not due to us exceeding our quota, but because GCP's capacity was exceeded.

Edited May 09, 2025 by Mark Lapierre