Document how to use `MODEL_ENGINE_CONCURRENCY_LIMITS`
Issue
We need to share how to use MODEL_ENGINE_CONCURRENCY_LIMITS
environment variables in AI Gateway.
Proposal
Can we follow-up to document how to use
MODEL_ENGINE_CONCURRENCY_LIMITS
? I assume that:
- We control the environment variable in the Vault for staging/production environment.
- We basically have to update these limits manually by referencing 3rd party API doc or contacting them. The value must be synced manually when they bump the concurrency limit e.g. https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/370+.
- This value is not used in the client side rate limiting. AI Gateway could try to request more than the
MODEL_ENGINE_CONCURRENCY_LIMITS
as an individual Cloud Run node can't identify the current number of concurrent requests across the fleet. Hence, this environment variable is served for only metrics purpose.
Auto generated
The following discussion from !522 (merged) should be addressed:
-
@reprazent started a discussion: (+2 comments) @shinya.maeda @tle_gitlab @achueshev Would any of you mind reviewing this?