feat: service alert dependencies (!4710) · Merge requests · GitLab.com / Runbooks

What

Introduce a new field the SLI definition dependsOn to specify hard dependency on another component. When a component depends on another component will create inhibit_rules which prevents the upstream component from alerting if the downstream component is already firing.

There are some validation rules for dependsOn:

The type exists.
The component belongs to the specified type.
The sli can't depend on a sli of the same type. This can change in the future but for now, let's prevent it to better understand the usage of inhibit_rules.

For example, if we apply the patch below:

diff --git a/metrics-catalog/services/web.jsonnet b/metrics-catalog/services/web.jsonnet
index a3edcb3f..f7a09765 100644
--- a/metrics-catalog/services/web.jsonnet
+++ b/metrics-catalog/services/web.jsonnet
@@ -138,6 +138,13 @@ metricsCatalog.serviceDefinition({
         toolingLinks.sentry(slug='gitlab/gitlab-workhorse-gitlabcom'),
         toolingLinks.kibana(title='Workhorse', index='workhorse', type='web', slowRequestSeconds=10),
       ],
+
+      dependsOn: [
+        {
+          component: 'rails_primary_sql',
+          type: 'patroni',
+        },
+      ],
     },

     imagescaler: {

We'll get the following inhibit_rules:

inhibit_rules:
- equal:
  - env
  - environment
  - pager
  source_matchers:
  - component="rails_primary_sql'
  - type="patroni"
  target_matchers:
  - component="workhorse"
  - type="web"

When rails_primary_sql is firing, the workhorse SLO alert will not fire.

Why

When a service like patroni violates the SLO, other upstream dependencies like web, api, and git also end up violating the SLO. It doesn't make sense to page/alert on the web service if the patroni service is already firing.

Benchmarks

This is adding some validation logic by looking up each service for dependsOn with the patch above we see a neglectable performance hit on generating alertmanager configuration. These benchmarks were run using hyperfine

Before:

$ hyperfine --warmup 3 './alertmanager/generate.sh'
Benchmark 1: ./alertmanager/generate.sh
  Time (mean ± σ):      4.230 s ±  0.029 s    [User: 5.632 s, System: 0.772 s]
  Range (min … max):    4.205 s …  4.305 s    10 runs

After:

$ hyperfine --warmup 3 './alertmanager/generate.sh'
Benchmark 1: ./alertmanager/generate.sh
  Time (mean ± σ):      4.277 s ±  0.038 s    [User: 5.741 s, System: 0.732 s]
  Range (min … max):    4.212 s …  4.324 s    10 runs

Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15766

Edited Jun 21, 2022 by Steve Xuereb

feat: service alert dependencies

What

Why

Benchmarks

Merge request reports