feat: service alert dependencies
What
Introduce a new field the SLI definition dependsOn
to specify hard dependency on another component. When a component depends on another
component will create
inhibit_rules
which prevents the upstream component from alerting if the downstream component is already firing.
There are some validation rules for dependsOn
:
- The
type
exists. - The
component
belongs to the specifiedtype
. - The sli can't depend on a sli of the same
type
. This can change in the future but for now, let's prevent it to better understand the usage ofinhibit_rules
.
For example, if we apply the patch below:
diff --git a/metrics-catalog/services/web.jsonnet b/metrics-catalog/services/web.jsonnet
index a3edcb3f..f7a09765 100644
--- a/metrics-catalog/services/web.jsonnet
+++ b/metrics-catalog/services/web.jsonnet
@@ -138,6 +138,13 @@ metricsCatalog.serviceDefinition({
toolingLinks.sentry(slug='gitlab/gitlab-workhorse-gitlabcom'),
toolingLinks.kibana(title='Workhorse', index='workhorse', type='web', slowRequestSeconds=10),
],
+
+ dependsOn: [
+ {
+ component: 'rails_primary_sql',
+ type: 'patroni',
+ },
+ ],
},
imagescaler: {
We'll get the following inhibit_rules
:
inhibit_rules:
- equal:
- env
- environment
- pager
source_matchers:
- component="rails_primary_sql'
- type="patroni"
target_matchers:
- component="workhorse"
- type="web"
When rails_primary_sql
is firing, the workhorse
SLO alert will not fire.
Why
When a service like patroni
violates the SLO, other
upstream dependencies like web
, api
, and git
also end up violating
the SLO. It doesn't make sense to page/alert on the web
service if the
patroni
service is already firing.
Benchmarks
This is adding some validation logic by looking up each service for
dependsOn
with the patch above we see a neglectable performance hit on
generating alertmanager configuration. These benchmarks were run using
hyperfine
Before:
$ hyperfine --warmup 3 './alertmanager/generate.sh'
Benchmark 1: ./alertmanager/generate.sh
Time (mean ± σ): 4.230 s ± 0.029 s [User: 5.632 s, System: 0.772 s]
Range (min … max): 4.205 s … 4.305 s 10 runs
After:
$ hyperfine --warmup 3 './alertmanager/generate.sh'
Benchmark 1: ./alertmanager/generate.sh
Time (mean ± σ): 4.277 s ± 0.038 s [User: 5.741 s, System: 0.732 s]
Range (min … max): 4.212 s … 4.324 s 10 runs
Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15766