Skip to content

Create metrics for gRPC message sizes

Problem to solve

We are currently unsure of when we encounter oversized messages for gRPC. We theorise that during long running sessions with large contexts that this may be creating either incoming or outgoing messages that go over the current threshold of 4MiB for our server.

The current implementation, and the use of default handling, will log either StopAsyncIteration or CancelledError into the logs with no indication what caused these errors. As these are both generic asyncio errors, they can be caused by various reasons.

Clients will receive a more helpful RESOURCE_EXHAUSTED status code in the response with a error message explaining the message was oversized, but as clients can be local we wouldn't be able to use these for observability of the gRPC server.

Current implementation

Currently gRPC logging is implemented through the monitoring_interceptor.py. The included prometheus metrics do not include message size, so it is hard for us to debug cases where message sizes may have exceeded the maximum message size of 4MiB.

Proposal

  • Implement message size metrics for the gRPC server (including cases where messages are oversized).

Further details

Interceptor limitations

In !3321 (closed) a custom interceptor approach was used to log message sizes and reject oversized messages. This MR was closed as it seemed unreliable for streaming connections (see this comment).

It seems like interceptors are not called during message rejection for oversized messages as this happens at the gRPC transport layer before it reaches the application.

This may be why our current metrics for grpc_code don't appear to show the RESOURCE_EXHAUSTED status_code: https://dashboards.gitlab.net/goto/vtXJ60jNR?orgId=1

gRPC logging

gRPC supports logging levels that can print logs from the C core. These can be set via environment variables with GRPC_VERBOSITY (deprecated) and GRPC_TRACE (documentation).

This could potentially be used to log message sizes.

Network monitoring

As requests to gRPC are sent through multiple network interfaces in flight, these may potentially be able to log metrics for message size.

Links / references

Issue confirming the in-built gRPC message size handling of 4MiB: #1367 (closed) MR implementing an interceptor for message size (closed, not merged): !3321 (closed)

Edited by Tim Morriss