Skip to content

perf(registry): adjust SLI for registry manifest routes

Closes #16781

Context 🌱

Recent improvements made to the GitLab container registry have resulted in it responding to requests and carrying out certain operations faster; ultimately resulting in a near perfect Apdex score for its SLIs and a lower error budget spend. With this MR we are tightening the Apdex threshold to better reflect the current (and much more faster) state of the registry manifest read & write routes

Approach 🔮

The Approach used to select the new threshold is a result of assessing the current (incident-less) trends in manifest read operations and manifest write operations over last week (i.e the week ending on the 15th of November) and noting that:

  • For read operations 99.7% of manifest read request fell below 0.1s, with a caveat of a spike going above for a 0.1s and reaching up to 0.18 for 2hrs in the duration of the week
  • For write operations 99.7% of manifest read request fell below 1s, with a caveat of a spike going above for a 1s and reaching up to 2.8s within the span of 2hrs in the duration of the week

Because of this we've chosen the new thresholds as follows:

server_route_manifest_reads:
    satisfiedThreshold: 0.1,
    toleratedThreshold: 0.25,

which happens to be 5x less of the older threshold

server_route_manifest_writes
    satisfiedThreshold: 1,
    toleratedThreshold: 2.5,

which happens to be 10x less of the older threshold

These newly proposed values above are also already available in the the current bucket choices for registry_http_request_duration_seconds_bucket (i.e {.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10, 25, 60})

Edited by Suleimi Ahmed

Merge request reports