fix(praefect): error and request rate
What
- Update
errorRate
to usegitalyHelper
error rate definition. - Update
requestRate
to usegitaly_service_client_requests_total
Why
In
gitlab-com/gl-infra/production#8537 (comment 1312672957)
we see that ResourceExhausted
is not being ignored as part of the
error codes that we say are OK, which ended up paging the on-call. We
have some code duplication for praefect and gitaly where we are
maintaining two different lists of grpc_codes
that we ignore. Using
the gitalyHelper
will have a single source of truth from now on.
Using gitalyHelper.gitalyGRPCErrorRate
also ends up changing the
metrics from grpc_server_handled_total
to
gitaly_service_client_requests_total
. grpc_server_handled_total
to
gitaly_service_client_requests_total
are captured in two ends of the
requests. In most cases, they are the same. Some cases make them
different:
-
gitaly_service_client_requests_total
: Client metrics include network roundtrips and the time before a request enters the interceptor that captures server requests. This metric can be more accurate. -
grpc_server_handled_total
: Server metrics are captured in the server interceptor. So, if a request doesn’t make it to the server, it is not accounted for. Client metrics still capture such. If a gRPC interceptor stays before the metric capturing interceptor, that one can reject a request and the metric interceptor has no idea.