Create new apdex method to reach four nines accuracy.
Problem
Prometheus exporters output metrics across the entire GitLab fleet, scraped at regular but not consistent intervals.
We calculate apdex by running a query for the apdex numerator (number of requests considered successful) and storing it in a recording rule and separately running a query for the apdex denominator (number of total requests) and storing that in a separate recording rule. We take those two numbers and divide success by total and get the apdex calculation.
This works acceptably well for low volume services, because we can assume that the query for numerator and denominator will return similar results. But for high volume services, it's not only possible, it's almost guaranteed that the numerator and denominator will come at different enough times that we'll get times where the number of successful events are higher than the total events or the number of successful events are arbitrarily low.
This leads to availability numbers that are not as accurate as we need them to be, occasionally skewing too high (over 100% availability) and often skewing too low (lower availability than we are actually getting).
Here is the service availability for Gitaly that we're calculating for the past two days.
Our highest availability reported was 100.41% at 21:40:11 on August 3rd, and our lowest was 99.63% at 4:57:49 on August 4th.
Here's the same time period but calculated in a single query (both numerator and denominator gathered from raw metrics).
We never go above 100% availability, and our lowest availability was 99.39% at 15:50:00 on August 3rd. If we look more closely at the recorded rules for the time period of lowest availability, we see we report 99.35% as our lowest availability during a similar time period.
This means that our availability metrics are currently only accurate to a total of 3 digits (99.3 vs 99.3), but we're measuring to four digits.
We see a similar spread for the higher cardinality metrics like web as well.
Apdex calculated from raw metrics has a high of 99.874% and a low of 99.819%.
Current calculated apdex has a high of 99.880% and a low of 99.817%.
This makes web generally accurate to 3 decimal points as well.
Solution
We have done a large number of experiments and have found a method that works to get us to four to five nines worth of accuracy.
Requirements:
- Recording from source metrics without intermediate recording rules
- Adding a brief offset to the queries to make sure that we are querying a consistent time period that has been scraped across all prometheus hosts.
- Creating an all or nothing apdex query that does the numerator and denominator together so that we require both parts to succeed in order to calculate based on accurate data.
Example of the rules:
Previous investigation issues
#2319 (closed) discusses what I'm calling the 'numerator/denominator' problem, where gathering the numerator and denominator at different times can occasionally result in more than 100% availability, and definitely results in less accurate apdex numbers than we need.
#2303 (closed) discusses delays caused by multiple different intermediate recording rules.
#2341 (closed) briefly touches on a few changes that were made (min vs avg) but mostly are related to recording rule gaps and are outside the scope of this issue.