metrics-catalog/README.md · 168397dee0ad955bfb473fd0ddb9146667eeaa13 · GitLab.com / Runbooks

feat: service alert dependecies · 46ed3118
Steve Xuereb authored Jun 16, 2022


What
---
Introduce a new field the SLI definition `dependsOn` to specify hard
dependency on another component. When a component depends on another
component will create [inhibit_rules](https://prometheus.io/docs/alerting/latest/alertmanager/#inhibition)
which prevents the upstream component from alerting if the downstream component is already firing.

There are some validation rules for `dependsOn`:
- The `type` exists.
- The `component` belongs to the specified `type`.
- The sli can't depend on a sli of the same `type`. This can change in
  the future but for now, let's prevent it to better understand the usage
  of `inhibit_rules`.

For example, if we apply the patch below:
```diff
diff --git a/metrics-catalog/services/web.jsonnet b/metrics-catalog/services/web.jsonnet
index a3edcb3f..f7a09765 100644
--- a/metrics-catalog/services/web.jsonnet
+++ b/metrics-catalog/services/web.jsonnet
@@ -138,6 +138,13 @@ metricsCatalog.serviceDefinition({
         toolingLinks.sentry(slug='gitlab/gitlab-workhorse-gitlabcom'),
         toolingLinks.kibana(title='Workhorse', index='workhorse', type='web', slowRequestSeconds=10),
       ],
+
+      dependsOn: [
+        {
+          component: 'rails_primary_sql',
+          type: 'patroni',
+        },
+      ],
     },

     imagescaler: {
```

We'll get the following `inhibit_rules`:
```
inhibit_rules:
- equal:
  - env
  - environment
  - pager
  source_matchers:
  - component="rails_primary_sql'
  - type="patroni"
  target_matchers:
  - component="workhorse"
  - type="web"
```

When `rails_primary_sql` is firing, the `workhorse` SLO alert will not fire.

Why
---
When a service like `patroni` violates the SLO, other upstream
dependencies like `web`, `api`, and `git` also end up violating the SLO.
It doesn't make sense to page/alert on the `web` service if the
`patroni` service is already firing.

Benchmarks
---
This is adding some validation logic by looking up each service for
`dependsOn` with the patch above we see a neglectable performance hit on
generating alertmanager configuration. These benchmarks were run using
[hyperfine](https://github.com/sharkdp/hyperfine)

Before:
```
$ hyperfine --warmup 3 './alertmanager/generate.sh'
Benchmark 1: ./alertmanager/generate.sh
  Time (mean ± σ):      4.230 s ±  0.029 s    [User: 5.632 s, System: 0.772 s]
  Range (min … max):    4.205 s …  4.305 s    10 runs
```

After:
```
$ hyperfine --warmup 3 './alertmanager/generate.sh'
Benchmark 1: ./alertmanager/generate.sh
  Time (mean ± σ):      4.277 s ±  0.038 s    [User: 5.741 s, System: 0.732 s]
  Range (min … max):    4.212 s …  4.324 s    10 runs
```

Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15766
Signed-off-by: Steve Azzopardi <sazzopardi@gitlab.com>
46ed3118