Use confidence interval / margin-of-error to reduce low RPS SLI alerts
Related to https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/4774
Currently, the Metrics-Catalog uses a low-SLI threshold to reduce noise from low-traffic endpoints.
While this has worked relatively well for GitLab.com, it doesn't work very well for single-tenant environments, such as GitLab Dedicated.
One of the reasons for this is that single-tenant environments have to deal with low RPS situations far more regularly than GitLab.com, which has a relatively high traffic volume, even in off-peak hours.
Sources
-
The primary inspiration for an alternative approach is this talk by Dylan Zehr at SLOConf 2021: "SLOconf 2021: Using Binomial proportion confidence intervals to reduce false positives": https://www.youtube.com/watch?v=R4nCsgt1qEU
-
A similar approach, of using Wilson Score Intervals, is used by Reddit: How Reddit ranking algorithms work. Side note, Randall Munroe, of xkcd fame, wrote a blog post describing the Reddit confidence sort: https://redditblog.blogspot.com/2009/10/reddits-new-comment-sorting-system.html
-
How Not To Sort By Average Rating, by Evan Miller: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html
-
Ranking Ratings: http://wordpress.mrreid.org/2014/05/20/ranking-ratings/