Skip to content

Do not track gRPC NotFound code as an error in Sentry

Mikhail Mazurskiy requested to merge ash2k/ignore-not-found into master

See gitlab-com/gl-infra/production#4847 (closed).

From the Slack conversation:

Here is why I think ignoring NotFound makes sense:

  • It’s a response code, indicating to the client that they have asked for an entity that does not exist. In 99.99% cases the fault is the client’s fault, not the server’s. It’s atypical that the entity is not there because the server lost data or something like that.

  • It does make sense to track how many not found codes are being returned, but I believe it should be a metric (and there is one already!) because Sentry is for collecting events about the application with high fidelity so that they can help troubleshoot/debug the application itself. Sentry is not a mechanism to debug customer’s problems. From their main page: Sentry's application monitoring platform helps every developer diagnose, fix, and optimize the performance of their code. Metrics are cheap (cpu/ram/network) compared to Sentry events and hence are a better fit for measuring frequently occurring events.

  • It should be the client who captures the event if and only if it considers the code to be a totally unexpected result of the request. For example, kas expects NotFound code for most/all requests because the file may not be there as a result of a user error. They just put the file somewhere else or didn’t put it anywhere at all. It’s not a bug in Gitaly or kas, no need to send it to Sentry. However, if it’s Unavailable then it’s a problem with a server/loadbalancer/client’s network configuration or connection. The client should report that so that it’s visible to the operator and they can be alerted and have extra data from this event to troubleshoot. This, actually, may also be a poor fit for Sentry. With high volume metrics+sampled logs+exponential backoff+circuit breaking may be a better fit.

  • If we capture client errors on the server and then the client captures them too that would just double the amount of events (and cost) for no benefit.

  • A good practice (I couldn’t find the article where folks used this approach) is to get number of events in Sentry to zero so that if something important has happened it stands out immediately. Clearly we (GitLab) are not following this practice. There are thousands of events and I don't know if anyone looks at them. If we add NotFound to the pile too, then it’s probably just more noise. We already ignore some of the codes and I fully agree that those are the right ones to ignore.

An alternative implementation: !3581 (merged)

Edited by Mikhail Mazurskiy

Merge request reports