Skip to content

Add GRPC client metrics to model-gateway

Corrective action from https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15657.

In incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/15657, it appears that the GRPC client-side load balancing failed, causing a hot-spot onto two of the 4 triton servers running.

This lead to GPU saturation on those pods and latency spikes and apdex drops on the service.

It is still not known why the GRPC round-robin load balancing failed but this is being investigated.

Proposal

Add GRPC client-side metrics to the model-gateway using the py-grpc-prometheus library https://pypi.org/project/py-grpc-prometheus/.

This will not fix the problem but may help us understand why certain nodes are dropping out of the server pool.