Categorize Sentry events by feature category

Problem

Gitlab Sentry captures a missive amount of events every week, 1M for GitLab.com project individual. It's really hard to use:

  • There are a lot of noises. All the events are not categorized or structured clearly. An engineer may encounter constantly changed dashboard, irrelevant streams of events. This situation strips the error watching ability.
  • Of course, searching for an individual event is still feasible although quite painful.
  • Subscribing to Sentry makes you get spammed. I believe that most engineers turn off the notifications, or just turn on for an interesting event.

I'm afraid that Sentry is being normalized. The engineers will eventually ignore Sentry,

We have other monitoring systems tackling different aspects of observability. However, they can't replace error tracking system (Sentry). Metrics captured by prometheus targets statistical information, they are irrelevant to details. Tracing system powered by Jaeger targets all transactions, and randomly captured by a sampling strategy. Kibana is the closest form that can provide sorts of functionalities of a error tracking system. However,

  • Sentry provides more information in the application layer, especially error stack trace.
  • Sentry events are likely an indicator of real errors, of course after all the noises are removed.
  • Kibana retention rate is 7 days. Error tracking, diagnosis, fixing, regression avoiding require higher retention time.

Therefore, Kibana serves wider forms of information, good for generic analytics. Sentry is more specialized in error tracking. We should spend time to reduce the noises and make Sentry great again™.

Feature category approach

The root cause of the noises is the size of uncategorized data sent to Sentry. They are just too huge for any individual to keep track. The big question is how to break down the huge amount of Sentry events into manageable chunks. One approach is to categorize the sentry data by feature category. Each feature category belongs to a stage group, defined carefully in the stages data. As each stage group handles a portion of the product, an sentry event occurring at any stacktrace affecting a user transaction should be in the radar of corresponding stage group. A group can always reassign an issue to another group causing the error originally.

This approach yields some potential benefits:

  • Each stage group can has their own Sentry dashboard (just like Stage Group dashboards). The dashboards include the sentry events and issues across all projects, not tight to a particularly stack. It means that in future, when the feature category context is broadcasted into all upstream and downstream dependencies (for example), all the errors can be watched in one unified dashboard.

Screenshot_from_2021-01-18_15-55-33

  • Better event notifications. The notifications can now be triggered from the aggregated Sentry metrics by each feature/group instead of individual events. The best part is that each stage/group can have their own alerting rules, depending on the volumes and severity of the events.

Screenshot_from_2021-01-18_15-54-24

  • Accurate auto-assignment. Previously, Sentry supports auto-assignment with transaction paths. It's exceptionally hard for our scale. In the recent versions, Sentry supports auto-assignment by tag. This unblocks a great potential for the development department.

Screenshot_from_2021-01-18_15-57-41

Data analytics

I wrote a script to pull and analyze the data from Sentry API. The data size of 100_000 latest events, Sentry pagination is 100 events per call, so 4 hours of crawling. For each event, I searched for existing category or analyze transaction (Web, Sidekiq, API, etc.) and translate the transaction back to feature category. Here are some findings

  • Nearly all the events in Sentry can be categorized, by feature category. Only 1.2% of data is uncategorizable. Actually, about half of them are mailers. I can spend more time to classify them more, but the data size is small enough.
  • 3.39% of events are marked as not_owned
  • The events and issues distribution are reasonable are reasonable. There isn't a group dominating the events.

events_by_group

issues_by_group

issues_by_category

events_by_category

Proposal

Thankfully, nearly all user-facing transactions handled by the application layer are already attached with a feature_category data.

  • Analyze Sentry data to see whether this issue is worth doing.
  • Add feature_category as a tag in Gitlab::ErrorTracking. This tag can be fetched from Labkit::Context.current.
    • For Web/Api/GraphQL transactions, the feature_category in the current context is set in the callbacks (here and here)
    • For Sidekiq transactions, the feature category is set at a middleware
  • Upgrade Gitlab Sentry to version 10.
  • Enable Sentry Discover feature
  • Enable Sentry metric alert feature
  • Add a script in the runbook to maintain the Saved Queries for each stage group
  • Tackling the dependencies (Gitaly, workhorse, ...)
  • Documentation
  • Broadcast to the development departments
Edited by Quang-Minh Nguyen (Ex-GitLab)