-
Steve Xuereb authored
What --- Introduce a new field the SLI definition `dependsOn` to specify hard dependency on another component. When a component depends on another component will create [inhibit_rules](https://prometheus.io/docs/alerting/latest/alertmanager/#inhibition) which prevents the upstream component from alerting if the downstream component is already firing. There are some validation rules for `dependsOn`: - The `type` exists. - The `component` belongs to the specified `type`. - The sli can't depend on a sli of the same `type`. This can change in the future but for now, let's prevent it to better understand the usage of `inhibit_rules`. For example, if we apply the patch below: ```diff diff --git a/metrics-catalog/services/web.jsonnet b/metrics-catalog/services/web.jsonnet index a3edcb3f..f7a09765 100644 --- a/metrics-catalog/services/web.jsonnet +++ b/metrics-catalog/services/web.jsonnet @@ -138,6 +138,13 @@ metricsCatalog.serviceDefinition({ toolingLinks.sentry(slug='gitlab/gitlab-workhorse-gitlabcom'), toolingLinks.kibana(title='Workhorse', index='workhorse', type='web', slowRequestSeconds=10), ], + + dependsOn: [ + { + component: 'rails_primary_sql', + type: 'patroni', + }, + ], }, imagescaler: { ``` We'll get the following `inhibit_rules`: ``` inhibit_rules: - equal: - env - environment - pager source_matchers: - component="rails_primary_sql' - type="patroni" target_matchers: - component="workhorse" - type="web" ``` When `rails_primary_sql` is firing, the `workhorse` SLO alert will not fire. Why --- When a service like `patroni` violates the SLO, other upstream dependencies like `web`, `api`, and `git` also end up violating the SLO. It doesn't make sense to page/alert on the `web` service if the `patroni` service is already firing. Benchmarks --- This is adding some validation logic by looking up each service for `dependsOn` with the patch above we see a neglectable performance hit on generating alertmanager configuration. These benchmarks were run using [hyperfine](https://github.com/sharkdp/hyperfine) Before: ``` $ hyperfine --warmup 3 './alertmanager/generate.sh' Benchmark 1: ./alertmanager/generate.sh Time (mean ± σ): 4.230 s ± 0.029 s [User: 5.632 s, System: 0.772 s] Range (min … max): 4.205 s … 4.305 s 10 runs ``` After: ``` $ hyperfine --warmup 3 './alertmanager/generate.sh' Benchmark 1: ./alertmanager/generate.sh Time (mean ± σ): 4.277 s ± 0.038 s [User: 5.741 s, System: 0.732 s] Range (min … max): 4.212 s … 4.324 s 10 runs ``` Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15766 Signed-off-by: Steve Azzopardi <sazzopardi@gitlab.com>