Increase observability of slow RPC calls caused by contention around spawn tokens
We regularly get reports about incidents where some RPCs start to show exceeding latencies. The root cause of those is typically not that the RPC's logic is slow, but instead that there is contention around the spawn token or about RPC rate limits. This is hard to observe though given that we track neither of those via any metrics. - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6117 - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6085 - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5880 - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5717 We should increase visibility by introducing new metrics: 1. Track how long it took to acquire spawn tokens 2. Track how long the RPCs have been queued in the rate limiter.
issue