Skip to content

Define request apdex counters instead

Bob Van Landuyt requested to merge bvl-apdex-sli-counters into master

What does this MR do?

With this, we'll emit 2 new counters from web processes that can be used to monitor apdex.

The gitlab_sli:rails_request_apdex:total counter is incremented for every successful (not a 500) that is not to a health endpoint.

The gitlab_sli:rails_request_apdex:success_total is incremented when the request took less than 1 second. We intend to customize this value per endpoint in the future.

Both these counters are labelled with feature_category and endpoint_id from the context.

The metrics would also be initialized on the first scrape. This means that a 0 would be available for every set of labels, avoiding bugs in calculations with these metrics.

To get to all of the feature_categorys and endpoint_ids for the initialization, we had to move some code that iterates all endpoints that was only used in tests to the application.

We know this would initialize about 2 * 2500 metrics per pod running a web server. So we'd like to roll this out in a controlled fashion, to make sure this doesn't impact our monitoring. Which is why this is feature flagged.

This also limits the initialization of these metrics to just web-processes. So they don't get generated for consoles or runner processes.

This also includes a developer-api to define SLIs and encourages initializing them with the known label sets.

For gitlab-com/gl-infra/scalability#1099 (closed)

Screenshots or Screencasts (strongly suggested)

A local instance that did not receive any requests looks like this:

image

When I start hitting it:

image

As we can see, the metrics start at 0 before some of them receive traffic

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Edited by Bob Van Landuyt

Merge request reports