Transactionally record error- and apdex rates for aggregation sets (!6243) · Merge requests · GitLab.com / Runbooks

Bob Van Landuyt requested to merge bvl/transactional-rate-recordings into master Sep 01, 2023

Record transactional rates from source metrics

This allows specifying errorRates or apdexRates on an aggregation-set that will be used for aggregating from source metrics.

Doing this will record both sides of what will become a ratio in a single recording rule with a recorded_rate label added using label_replace signifying which rate is being recorded.

The recording rules would look as following for error rates:

record: 'source_error:rates_5m',
expr: |
  label_replace(
    sum by (a,b) (
      rate(some_error_total_count{}[5m] offset 2s)
    )
    or
    (
      0 * sum by (a,b) (
        rate(some_total_count{}[5m] offset 2s)
      )
    ),
    'recorded_rate', 'error_rate' , '', ''
  )
  or
  label_replace(
    sum by (a,b) (
      rate(some_total_count{}[5m] offset 2s)
    ),
    'recorded_rate', 'ops_rate' , '', ''
  )

Note the fallback to 0 * ops_rate in the error_rate portion. This is done to make sure that we record an error rate of 0, even when the error rate is missing as long as the operation rate is present. We do this because more often than not or error rates are not properly initialized. If we didn't have this fallback, we wouldn't be able to record an error ratio until we've seen an error.

The recording rule for apdex ratios looks as follows:

record: 'source_apdex:rates_5m',
expr: |
  label_replace(
    sum by (a,b) (
      rate(some_apdex_success_total_count{}[5m] offset 2s)
    ),
    'recorded_rate', 'success_rate' , '', ''
  )
  or
  label_replace(
    sum by (a,b) (
      rate(some_apdex_total_count{}[5m] offset 2s)
    ),
    'recorded_rate', 'apdex_weight' , '', ''
  )

Here we aren't using the fallback because we expect most of our apdex measurements to be successful, meaning the metric would not often be missing.

Allow generating rule files in nested directories

This allows generating rules in subdirectories for different environments.

This is not yet supported by our current Thanos and prometheus setup, that receive their rules through the syncinator. But the new thanos setup, already in use for thanos-staging, does.

It also updates the make generate script to delete rule files in subdirectories so they don't get left behind when regenerating.

[Thanos-staging] Transactional rates from source metrics

This adds experimental recording rules for recording aggregation-sets from source metrics in thanos staging.

The experimental aggregation set includes the new transactional rates besides the old rates we currently record in Prometheus. This is done for 4 services to start.

This does not yet allow recordin ratio's from the source metrics or in transformations. This will be done in gitlab-com/gl-infra/scalability#2475 (closed) later when we get rid of the intermediate recording rules in Prometheus.

[Thanos-staging] Record global aggregations with transactional rates

This takes transactional rates from the source aggregation we added in the previous commit and transforms it into the service aggregation in thanos-staging.

This does not yet include using these transactional rates in the ratio recordings.

Edited Sep 18, 2023 by Bob Van Landuyt

Transactionally record error- and apdex rates for aggregation sets

Merge request reports