Skip to content

fix(praefect): error and request rate

Steve Xuereb requested to merge fix/praefect-gitaly-sli into master

What

  • Update errorRate to use gitalyHelper error rate definition.
  • Update requestRate to use gitaly_service_client_requests_total

Why

In gitlab-com/gl-infra/production#8537 (comment 1312672957) we see that ResourceExhausted is not being ignored as part of the error codes that we say are OK, which ended up paging the on-call. We have some code duplication for praefect and gitaly where we are maintaining two different lists of grpc_codes that we ignore. Using the gitalyHelper will have a single source of truth from now on.

Using gitalyHelper.gitalyGRPCErrorRate also ends up changing the metrics from grpc_server_handled_total to gitaly_service_client_requests_total. grpc_server_handled_total to gitaly_service_client_requests_total are captured in two ends of the requests. In most cases, they are the same. Some cases make them different:

  1. gitaly_service_client_requests_total: Client metrics include network roundtrips and the time before a request enters the interceptor that captures server requests. This metric can be more accurate.
  2. grpc_server_handled_total: Server metrics are captured in the server interceptor. So, if a request doesn’t make it to the server, it is not accounted for. Client metrics still capture such. If a gRPC interceptor stays before the metric capturing interceptor, that one can reject a request and the metric interceptor has no idea.

Merge request reports