Document how to use `MODEL_ENGINE_CONCURRENCY_LIMITS`

Issue

We need to share how to use MODEL_ENGINE_CONCURRENCY_LIMITS environment variables in AI Gateway.

Proposal

Can we follow-up to document how to use MODEL_ENGINE_CONCURRENCY_LIMITS? I assume that:

We control the environment variable in the Vault for staging/production environment.

We basically have to update these limits manually by referencing 3rd party API doc or contacting them. The value must be synced manually when they bump the concurrency limit e.g. https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/issues/370+.

This value is not used in the client side rate limiting. AI Gateway could try to request more than the MODEL_ENGINE_CONCURRENCY_LIMITS as an individual Cloud Run node can't identify the current number of concurrent requests across the fleet. Hence, this environment variable is served for only metrics purpose.

Auto generated

The following discussion from !522 (merged) should be addressed:

@reprazent started a discussion: (+2 comments)

@shinya.maeda @tle_gitlab @achueshev Would any of you mind reviewing this?