feat: calculate Wilson Score Intervals for SLIs to handle low RPS SLIs (!7336) · Merge requests · GitLab.com / Runbooks

Andrew Newdigate requested to merge wilson-confidence-intervals into master May 13, 2024

Part of #153 and https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/4774

👉 Required for !7348 (closed)

👉 Required for !7349 (merged)

Adds support for Wilson Score Intervals in SLIs, as an alternative approach to handling low-RPS SLIs.

Preparation Reading/Watching

The primary inspiration for an alternative approach is this talk by Dylan Zehr at SLOConf 2021: "SLOconf 2021: Using Binomial proportion confidence intervals to reduce false positives": https://www.youtube.com/watch?v=R4nCsgt1qEU (6 minutes)
A similar approach, of using Wilson Score Intervals, is used by Reddit: How Reddit ranking algorithms work. Side note, Randall Munroe, of xkcd fame, wrote a blog post describing the Reddit confidence sort: https://redditblog.blogspot.com/2009/10/reddits-new-comment-sorting-system.html
How Not To Sort By Average Rating, by Evan Miller: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html
Ranking Ratings: http://wordpress.mrreid.org/2014/05/20/ranking-ratings/
Javascript implementation of Wilson Score Intervals (good explanation in README.md): https://github.com/msn0/wilson-score-interval
Wikipedia Entry for Wilson Score Intervals https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval

What's in this change?

This change adds confidence interval recording rules for Apdex and Error Rates.

These confidence intervals, implemented as Wilson Score Intervals, are implemented as a pair of recording rules:

gitlab_component_apdex:confidence:ratio_1h{confidence="98%"}
gitlab_component_errors:confidence:ratio_1h{confidence="98%"}

(plus additional recording rules for other burn rate intervals: 5m, 30m, 6h).

Control of confidence intervals is through AggregationSet configuration

Controling which burn rates / aggregation sets get the new recording rules is done through the Aggregation Set configuration:

    metricFormats: {
      // Confidence Interval Ratios
      apdexConfidenceRatio: 'gitlab_component_apdex:confidence:ratio_%s',
      errorConfidenceRatio: 'gitlab_component_errors:confidence:ratio_%s',
    }

If this configuration does not exist, then the confidence recording rules will not be added to the aggregation set.

The confidence interval is generated using a Wilson Score, which is an implementation of a Binomial proportion confidence interval. You can read about this on Wikipedia, here: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Wilson_score_interval

The pseudocode for calculate the Wilson Score is as follows:

// phatRatio is a ratio between 0 and 1
// total is the number of samples
// z is a z-score that reflects the confidence 
// level to be calculated 
function (phatRatio, total, z) {
    if (total === 0) {
        return {
            lower: 0,
            upper: 0
        };
    }
    
    const a = phatRatio + z * z / (2 * total);
    const b = z * sqrt((phatRatio * (1 - phatRatio) + z * z / (4 * total)) / total);
    const c = 1 + z * z / total;

    return {
        lower: (a - b) / c,
        upper: (a + b) / c
    };
};

In other words, the Wilson Score is a function over the observed value, phatRatio, and the number of samples.

Adapting Wilson Score values for SLIs.

The lower the number of samples, the wider the boundary area for a confidence interval. Likewise, the higher the confidence value, the wider the boundary.

For example, for 80% confidence, we'll have a narrower interval than a 99% confidence.

Likewise, for 100000 samples we'll have a narrower interval than 100 samples.

In order to calculate the number of samples, we use the RPS for the SLI, and multiple this by the window duration in seconds.

For example, for a 0.1RPS over an hour (3600 seconds), we know that there have been 0.1 * 3600 = 360 samples.

This is plugged into the Wilson Score as the total.

The phatRatio is either the apdex ratio or the error ratio.

Interval Boundaries

The Wilson Score interval function produces two values: an upper and lower boundary. However, for evaluating the confidence interval against the SLO, we will only use the upper value for Apdex and the lower value for Error Ratio.

This is because Apdex is a success/total SLI, whereas Error Ratio is an error/total value. Since we're using the optimistic confidence boundary, we need to use the lower one for errors and the higher one for Apdex.

Since there is a cost to calculating these intervals as Prometheus Recording Rules, we skip the unused boundaries, calculating one per SLI, not both.

How RPS affects the confidence level for different error rates

First off, a thought experiment

Imagine you had no way of knowing whether your server application was functioning correctly or broken.

Instead, you could contact each user who had recently interacted with your server and asked to vote with a 👍 or 👎 to rate their interaction with your server.

If you asked 10000 users, and 9990 of them gave a 👍 you would be fairly confident that everything is working as expected.

But if you asked 10 users, and 9 of them gave a 👍, would you be as confident?

If you understand statistics and chance, you would probably understand (possibly intuitively) that asking 10000 users is better than using 10 users. It might just be the case that all 10 users happened to do a really simple operations which didn't test the server very thoroughly.

The more votes you get, the more confident you can be that the result reflects the internal state of the server. Likewise, the fewer votes you have, the higher the likelihood that the poll results changes as you ask more people.

The same is true for elections: a large swing is more likely when polling opens, and pundits become more confident in the result as the number of votes counted increases.

The same is also true for online reviews: most people would trust 8000 👍 for a product our of 10000 reviews more than they would trust 9 👍 out of 10 reviews (even though the latter is higher).

Going back to the server, each SLI result could be considered to be a vote on the current health of a server, but during low RPS periods, there aren't a lot of samples to count.

For this reason, we should treat the signal that we receive from an SLI as a confidence boundary: the more samples, the more confidence in the signal, the narrower the margin-of-error.

In statistics, there's a method of evaluating these confidence intervals: Binomial proportion confidence intervals. This change relies on that approach.

Back to regular programming...

These graphs attempt to convey the effect that low RPS will have on the confidence interval for an SLI.

It plots the adjusted 98% confidence interval for a set of error ratios, over various RPS values.

As the plot shows, the lower the RPS, the lower the adjusted error ratio. At traffic builds up, the adjusted value tends towards the actual value.

Calculated values as per https://docs.google.com/spreadsheets/d/16r_fO71VxaY3RhBWWcZ0KCoQTRTxZ_IhBAnZ92pQYa0/edit#gid=1595720895

How RPS affects the confidence level for different apdex values

For Apdex, a similar, but reversed trend is seen. For low RPS, the Apdex remains near 100%. As the RPS increases, the confidence value tends towards the actual apdex signal.

Calculated values as per https://docs.google.com/spreadsheets/d/16r_fO71VxaY3RhBWWcZ0KCoQTRTxZ_IhBAnZ92pQYa0/edit#gid=1595720895

Next steps

This is a first step and these confidence intervals have not yet been plugged into alerting.

In follow-on MRs, we'll visualize the confidence interval and start using it in alerting.

Edited May 15, 2024 by Craig Miskell

feat: calculate Wilson Score Intervals for SLIs to handle low RPS SLIs