Review GRPC settings for model-gateway
Corrective action from https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15657.
In incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15657, it appears that the GRPC client-side load balancing failed, causing a hot-spot onto two of the 4 triton servers running.
During the incident, we saw some logs that are correlated with the incident:
ipv4:10.4.1.3:8001: Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings". Current keepalive time (before throttling): 30000ms"
https://cloudlogging.app.goo.gl/xkFd2hJBfhsD4AS96
Additionally we see the following message appearing in the logs at a high frequency. This is probably not the cause, but we should fix it at the same time:
grpc.keepalive_permit_without_calls treated as bool but set to 2 (assuming true)"
Edited by Andrew Newdigate