Use intermediate recording rules in the metrics catalog

This change allows the metrics catalog to use and manage its own recording rule evaluations.

The goal is to reduce the amount of work Prometheus needs to do to evaluate SLIs, particularly on very high cardinality series that are used repeatedly for some SLI recording rules.

Whats Happening?

Currently the metrics catalog generates recording rules directly from 'raw' underlying metrics:

`raw metric` -> `(aggregation)` -> `aggregated component level aggregation #1`
`raw metric` -> `(aggregation)` -> `aggregated component level aggregation #2`
`raw metric` -> `(aggregation)` -> `aggregated component level aggregation #3`

This change allows intermediate recording rules, which should greatly improve the performance of the recording rules evaluated through the metrics catalog:

`raw metric` -> `(intermediate aggregation)`
                `(intermediate aggregation)` -> `aggregated component level aggregation #1`
                `(intermediate aggregation)` -> `aggregated component level aggregation #2`
                `(intermediate aggregation)` -> `aggregated component level aggregation #3`

Since the intermediate aggregations have much lower cardinality than the raw metrics, and they are reused in multiple recording rules, pre-aggregating them allows for far fewer series to be evaluated on each cycle.

How it works

If we wish to use intermediate recording rules, we specify the metrics we would like to apply the intermediate recording rules in the metrics catalog for a service using the recordingRuleMetrics attribute, for example:

  // Use recordingRuleMetrics to specify a set of metrics with known high
  // cardinality. The metrics catalog will generate recording rules with
  // the appropriate aggregations based on this set.
  // Use sparingly, and don't overuse.
  recordingRuleMetrics: [
    'sidekiq_jobs_completion_seconds_bucket',
    'sidekiq_jobs_queue_duration_seconds_bucket',
    'sidekiq_jobs_failed_total',
  ],

Step 1: Collecting labels

The metrics catalog will evaluate all metrics in the metrics catalog to obtain the list of labels that the intermediate recording rules will be aggregated over.

This uses:

  1. The standard GitLab.com label taxonomy: environment, tier, type, shard, stage
  2. The labels for any selectors used in SLIs in the metrics catalog
  3. Any significantLabels

For example, given the following (semi-hypothetical) component definition from the metrics catalog:

sidekiq_urgent: {
  apdex: histogramApdex(
    histogram='sidekiq_jobs_completion_seconds_bucket',
    selector={ urgency: 'high' },
    satisfiedThreshold=10,
  ),
        
  requestRate: rateMetric(
    counter='sidekiq_jobs_completion_seconds_bucket',
    selector={ urgency: 'high', le: '+Inf' },
  ),

  errorRate: rateMetric(
    counter='sidekiq_jobs_failed_total',
    selector={ urgency: 'high' },
  ),

  significantLabels: ['feature_category'],
}

The following labels will be used in the recording rules:

  • sidekiq_jobs_completion_seconds_bucket -> environment, tier, type, shard, stage (standard labels), urgency, le (from selector), feature_category (from significantLabels)
  • sidekiq_jobs_failed_total -> environment, tier, type, shard, stage (standard labels), urgency (from selector), feature_category (from significantLabels)

These labels are automatically gleaned from the metrics-catalog through reflection. If the definitions change, then the recording rule will change too.

Step 2: Substitution

When a promql expression is being generated, the metrics-catalog will determine with an expression can be substituted for a recording rule. It is can be, the expression will be replaced.

The conditions for substitution are:

  1. The aggregation labels of the expression are within the subset of labels aggregated by the intermediate recording rule generated in step 1
  2. The selector labels of the expression are within the subset of labels aggregated by the intermediate recording rule generated in step 1
  3. The range selector is one for which a recording rule is being generated. Currently this is: 1m, 5m, 30m, 1h and 6h.
  4. In the special case of the range selector being $__interval, the range selector of 5m is used. This is because variable ranges cannot be used with recording rules
  5. Currently, the evaluator will only work with sum(rate()) expressions, but this could be extended in future.

Examples

Operation Rate pre-substitution:

   - record: gitlab_component_ops:rate_6h
     labels:
       component: shard_urgent_cpu_bound
       tier: sv
       type: sidekiq
     expr: |
       sum by (environment,stage) (
         rate(sidekiq_jobs_completion_seconds_bucket{le="+Inf",shard="urgent-cpu-bound"}[6h])
       )

Operation Rate post-substitution:

   - record: gitlab_component_ops:rate_6h
     labels:
       component: shard_urgent_cpu_bound
       tier: sv
       type: sidekiq
     expr: |
       sum by (environment,stage) (
         sli_aggregations:sidekiq_jobs_completion_seconds_bucket_rate6h{le="+Inf",shard="urgent-cpu-bound"}
       )

These substations are applied in all places that the expression is generated, including Alert Annotations, Recording Rules and Grafana expressions

Reviewing this MR

This MR consists of two commits:

  • Allow the use of recording rules in the metrics catalog - this adds the functionality without adding any new recording rules. This is effectively a no-op operation. The output is the same as before, except for some label selector order changes and whitespace changes
  • Use recording rules for the sidekiq service - this adds recording rules for the sidekiq service and regenerates the recording rules appropriately
Edited by Andrew Newdigate

Merge request reports

Loading