Use confidence interval / margin-of-error to reduce low RPS SLI alerts

Related to https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/4774

Currently, the Metrics-Catalog uses a low-SLI threshold to reduce noise from low-traffic endpoints.

While this has worked relatively well for GitLab.com, it doesn't work very well for single-tenant environments, such as GitLab Dedicated.

One of the reasons for this is that single-tenant environments have to deal with low RPS situations far more regularly than GitLab.com, which has a relatively high traffic volume, even in off-peak hours.

Sources

  1. The primary inspiration for an alternative approach is this talk by Dylan Zehr at SLOConf 2021: "SLOconf 2021: Using Binomial proportion confidence intervals to reduce false positives": https://www.youtube.com/watch?v=R4nCsgt1qEU

    1. Slides:
      1. screenshot-andrewn-2024-05-13T14h57Z_2x
      2. screenshot-andrewn-2024-05-13T14h55Z_2x
  2. A similar approach, of using Wilson Score Intervals, is used by Reddit: How Reddit ranking algorithms work. Side note, Randall Munroe, of xkcd fame, wrote a blog post describing the Reddit confidence sort: https://redditblog.blogspot.com/2009/10/reddits-new-comment-sorting-system.html

  3. How Not To Sort By Average Rating, by Evan Miller: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html

  4. Ranking Ratings: http://wordpress.mrreid.org/2014/05/20/ranking-ratings/

Edited by Andrew Newdigate