perf(registry): adjust SLI for registry manifest routes
Closes #16781
🌱
Context Recent improvements made to the GitLab container registry have resulted in it responding to requests and carrying out certain operations faster; ultimately resulting in a near perfect Apdex score for its SLIs and a lower error budget spend. With this MR we are tightening the Apdex threshold to better reflect the current (and much more faster) state of the registry manifest
read & write routes
🔮
Approach The Approach used to select the new threshold is a result of assessing the current (incident-less) trends in manifest read operations and manifest write operations over last week (i.e the week ending on the 15th of November) and noting that:
- For read operations 99.7% of manifest read request fell below 0.1s, with a caveat of a spike going above for a 0.1s and reaching up to 0.18 for 2hrs in the duration of the week
- For write operations 99.7% of manifest read request fell below 1s, with a caveat of a spike going above for a 1s and reaching up to 2.8s within the span of 2hrs in the duration of the week
Because of this we've chosen the new thresholds as follows:
server_route_manifest_reads:
satisfiedThreshold: 0.1,
toleratedThreshold: 0.25,
which happens to be 5x less of the older threshold
server_route_manifest_writes
satisfiedThreshold: 1,
toleratedThreshold: 2.5,
which happens to be 10x less of the older threshold
These newly proposed values above are also already available in the the current bucket choices for registry_http_request_duration_seconds_bucket
(i.e {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 25, 60}
)