Skip to content
  • Steve Xuereb's avatar
    feat: service alert dependecies · 46ed3118
    Steve Xuereb authored
    
    
    What
    ---
    Introduce a new field the SLI definition `dependsOn` to specify hard
    dependency on another component. When a component depends on another
    component will create [inhibit_rules](https://prometheus.io/docs/alerting/latest/alertmanager/#inhibition)
    which prevents the upstream component from alerting if the downstream component is already firing.
    
    There are some validation rules for `dependsOn`:
    - The `type` exists.
    - The `component` belongs to the specified `type`.
    - The sli can't depend on a sli of the same `type`. This can change in
      the future but for now, let's prevent it to better understand the usage
      of `inhibit_rules`.
    
    For example, if we apply the patch below:
    ```diff
    diff --git a/metrics-catalog/services/web.jsonnet b/metrics-catalog/services/web.jsonnet
    index a3edcb3f..f7a09765 100644
    --- a/metrics-catalog/services/web.jsonnet
    +++ b/metrics-catalog/services/web.jsonnet
    @@ -138,6 +138,13 @@ metricsCatalog.serviceDefinition({
             toolingLinks.sentry(slug='gitlab/gitlab-workhorse-gitlabcom'),
             toolingLinks.kibana(title='Workhorse', index='workhorse', type='web', slowRequestSeconds=10),
           ],
    +
    +      dependsOn: [
    +        {
    +          component: 'rails_primary_sql',
    +          type: 'patroni',
    +        },
    +      ],
         },
    
         imagescaler: {
    ```
    
    We'll get the following `inhibit_rules`:
    ```
    inhibit_rules:
    - equal:
      - env
      - environment
      - pager
      source_matchers:
      - component="rails_primary_sql'
      - type="patroni"
      target_matchers:
      - component="workhorse"
      - type="web"
    ```
    
    When `rails_primary_sql` is firing, the `workhorse` SLO alert will not fire.
    
    Why
    ---
    When a service like `patroni` violates the SLO, other upstream
    dependencies like `web`, `api`, and `git` also end up violating the SLO.
    It doesn't make sense to page/alert on the `web` service if the
    `patroni` service is already firing.
    
    Benchmarks
    ---
    This is adding some validation logic by looking up each service for
    `dependsOn` with the patch above we see a neglectable performance hit on
    generating alertmanager configuration. These benchmarks were run using
    [hyperfine](https://github.com/sharkdp/hyperfine)
    
    Before:
    ```
    $ hyperfine --warmup 3 './alertmanager/generate.sh'
    Benchmark 1: ./alertmanager/generate.sh
      Time (mean ± σ):      4.230 s ±  0.029 s    [User: 5.632 s, System: 0.772 s]
      Range (min … max):    4.205 s …  4.305 s    10 runs
    ```
    
    After:
    ```
    $ hyperfine --warmup 3 './alertmanager/generate.sh'
    Benchmark 1: ./alertmanager/generate.sh
      Time (mean ± σ):      4.277 s ±  0.038 s    [User: 5.741 s, System: 0.732 s]
      Range (min … max):    4.212 s …  4.324 s    10 runs
    ```
    
    Reference: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15766
    Signed-off-by: default avatarSteve Azzopardi <sazzopardi@gitlab.com>
    46ed3118