Automate alert manager configuration from service catalog

This MR is a precursor to gitlab-com/gl-infra/scalability#214

Closes gitlab-com/gl-infra/scalability#344

What does this do?

Following on from gitlab-com/gl-infra&234 (closed) and !2379 (merged), we now have a single alertmanager configuration, generated from jsonnet.

At present, however, the routing rules are not very consistent.

In !2546 (merged), we added tests to verify our alertmanager configuration. This was done without changing the configuration, but it was clear that there was not a great deal of consistency in that configuration, as is documented in comments such as this !2546 (diffs)

This MR attempts to add some consistent to our routing configuration.

  1. channel configurations are replaced with team configurations. Teams are configured in the service catalog.
    1. To add a slack channel for a team, add the slack_alerts_channel attribute to the team in https://gitlab.com/gitlab-com/runbooks/blob/master/services/service-catalog.yml
    2. Any teams matching on the team label (eg team=gitaly) will then have duplicate alerts sent to their respective team alerting channels.
  2. Alerts which are configured with teams are validated to ensure that the team exists in the service manager configuration. At present, many of the channels configured for alerts are non-existent.
  3. Slack receivers in the routing configuration are automatically generated from service catalog teams.

Routing tree

The routing tree now looks as follows:

Routing tree:
.
└── default-route  receiver: prod_alerts_slack_channel
    ├── {alertname="SnitchHeartBeat",env="ops"}  receiver: dead_mans_snitch_ops
    ├── {alertname="SnitchHeartBeat",env="gprd"}  receiver: dead_mans_snitch_gprd
    ├── {alertname="SnitchHeartBeat",env="gstg"}  receiver: dead_mans_snitch_gstg
    ├── {alertname="SnitchHeartBeat",env="pre"}  receiver: dead_mans_snitch_pre
    ├── {alertname="SnitchHeartBeat",env="testbed"}  receiver: dead_mans_snitch_testbed
    ├── {env="gprd",pager="issue",project="gitlab.com/gitlab-com/gl-infra/infrastructure"}  receiver: issue:gitlab.com/gitlab-com/gl-infra/infrastructure
    ├── {env="gprd",pager="issue",project="gitlab.com/gitlab-com/gl-infra/production"}  receiver: issue:gitlab.com/gitlab-com/gl-infra/production
    ├── {pager="pagerduty"}  continue: true  receiver: prod_pagerduty
    │   ├── {env="gstg"}  receiver: non_prod_pagerduty
    │   ├── {env="dr"}  receiver: non_prod_pagerduty
    │   ├── {env="pre"}  receiver: non_prod_pagerduty
    │   ├── {env="gprd",slo_alert="yes",stage="cny"}  receiver: slo_gprd_cny
    │   ├── {env="gprd",slo_alert="yes",stage="main"}  receiver: slo_gprd_main
    │   └── {env="gprd",slo_alert="yes",stage="main"}  receiver: slo_gprd_main
    ├── {rules_domain="general"}  continue: true  receiver: slack_bridge-nonprod
    │   └── {env="gprd"}  receiver: slack_bridge-prod
    ├── {env="gprd",team="gitaly"}  continue: true  receiver: team_gitaly_alerts_channel
    ├── {env="gprd",team="verify"}  continue: true  receiver: team_verify_alerts_channel
    ├── {env="pre"}  receiver: nonprod_alerts_slack_channel
    ├── {env="dr"}  receiver: nonprod_alerts_slack_channel
    ├── {env="gstg"}  receiver: nonprod_alerts_slack_channel
    └── {pager="pagerduty"}  receiver: production_slack_channel

image

Routing tree explained

The new routing tree works like this:

  1. If nothing else matches, all alerts will go to prod_alerts_slack_channel (#alerts)
    1. SnitchHeartBeat are routed to dead_mans_snitch and routing terminates.
    2. Issue alerts are routed to GitLab issues and routing terminates.
    3. Pagerduty alerts are routed to Pagerduty and continue
    4. Slackline alerts are routed to Slackline and continue
    5. Team alerts are routed to team slack channels and continue
    6. Non production alerts are routed to (#alerts-nonprod) and routing terminates. Note that at present these alerts are being sent to #alerts
    7. Pagerduty alerts (second match) are routed to the #production slack channel and routing terminates
    8. Anything that remains will be routed to the #alerts slack channel and routing terminates

How can I be sure it works?

We have tests for our alertmanager config now. Review the tests as a way of ensuring the rules are doing what you would expect.

The tests are in alertmanager/routing-tests.jsonnet

Configuration changes from master

--- alertmanager-master.yml	2020-07-20 19:40:49.000000000 +0200
+++ ./alertmanager.yml	2020-07-20 19:40:49.000000000 +0200
@@ -56,7 +56,7 @@
       note: '{{ template "slack.text" . }}'
     send_resolved: true
     service_key: secret
-- name: main_alerts_channel
+- name: prod_alerts_slack_channel
   slack_configs:
   - channel: "#alerts"
     color: '{{ template "slack.color" . }}'
@@ -65,7 +65,7 @@
     text: '{{ template "slack.text" . }}'
     title: '{{ template "slack.title" . }}'
     title_link: '{{ template "slack.link" . }}'
-- name: pager_alerts_channel
+- name: production_slack_channel
   slack_configs:
   - channel: "#production"
     color: '{{ template "slack.color" . }}'
@@ -74,52 +74,16 @@
     text: '{{ template "slack.text" . }}'
     title: '{{ template "slack.title" . }}'
     title_link: '{{ template "slack.link" . }}'
-- name: ci-cd_alerts_channel
+- name: nonprod_alerts_slack_channel
   slack_configs:
-  - channel: "#alerts-ci-cd"
-    color: '{{ template "slack.color" . }}'
-    icon_emoji: '{{ template "slack.icon" . }}'
-    send_resolved: true
-    text: '{{ template "slack.text" . }}'
-    title: '{{ template "slack.title" . }}'
-    title_link: '{{ template "slack.link" . }}'
-- name: ci-cd_low_priority_alerts_channel
-  slack_configs:
-  - channel: "#alerts-ci-cd"
-    color: '{{ template "slack.color" . }}'
-    icon_emoji: '{{ template "slack.icon" . }}'
-    send_resolved: true
-    text: '{{ template "slack.text" . }}'
-    title: '{{ template "slack.title" . }}'
-    title_link: '{{ template "slack.link" . }}'
-- name: database_alerts_channel
-  slack_configs:
-  - channel: "#database"
-    color: '{{ template "slack.color" . }}'
-    icon_emoji: '{{ template "slack.icon" . }}'
-    send_resolved: true
-    text: '{{ template "slack.text" . }}'
-    title: '{{ template "slack.title" . }}'
-    title_link: '{{ template "slack.link" . }}'
-- name: database_low_priority_alerts_channel
-  slack_configs:
-  - channel: "#database"
+  - channel: "#alerts-nonprod"
     color: '{{ template "slack.color" . }}'
     icon_emoji: '{{ template "slack.icon" . }}'
     send_resolved: true
     text: '{{ template "slack.text" . }}'
     title: '{{ template "slack.title" . }}'
     title_link: '{{ template "slack.link" . }}'
-- name: gitaly_alerts_channel
-  slack_configs:
-  - channel: "#g_gitaly"
-    color: '{{ template "slack.color" . }}'
-    icon_emoji: '{{ template "slack.icon" . }}'
-    send_resolved: true
-    text: '{{ template "slack.text" . }}'
-    title: '{{ template "slack.title" . }}'
-    title_link: '{{ template "slack.link" . }}'
-- name: gitaly_low_priority_alerts_channel
+- name: team_gitaly_alerts_channel
   slack_configs:
   - channel: "#gitaly-alerts"
     color: '{{ template "slack.color" . }}'
@@ -128,27 +92,9 @@
     text: '{{ template "slack.text" . }}'
     title: '{{ template "slack.title" . }}'
     title_link: '{{ template "slack.link" . }}'
-- name: observability_alerts_channel
+- name: team_verify_alerts_channel
   slack_configs:
-  - channel: "#observability"
-    color: '{{ template "slack.color" . }}'
-    icon_emoji: '{{ template "slack.icon" . }}'
-    send_resolved: true
-    text: '{{ template "slack.text" . }}'
-    title: '{{ template "slack.title" . }}'
-    title_link: '{{ template "slack.link" . }}'
-- name: observability_low_priority_alerts_channel
-  slack_configs:
-  - channel: "#observability"
-    color: '{{ template "slack.color" . }}'
-    icon_emoji: '{{ template "slack.icon" . }}'
-    send_resolved: true
-    text: '{{ template "slack.text" . }}'
-    title: '{{ template "slack.title" . }}'
-    title_link: '{{ template "slack.link" . }}'
-- name: slack_alerts_general
-  slack_configs:
-  - channel: "#alerts-gen-svc-test"
+  - channel: "#alerts-ci-cd"
     color: '{{ template "slack.color" . }}'
     icon_emoji: '{{ template "slack.icon" . }}'
     send_resolved: true
@@ -205,13 +151,7 @@
     send_resolved: true
     url: https://gitlab.com/gitlab-com/gl-infra/production/prometheus/alerts/notify.json
 route:
-  group_by:
-  - env
-  - tier
-  - type
-  - alertname
-  - stage
-  receiver: main_alerts_channel
+  receiver: prod_alerts_slack_channel
   repeat_interval: 8h
   routes:
   - continue: false
@@ -254,200 +194,93 @@
       env: testbed
     receiver: dead_mans_snitch_testbed
     repeat_interval: 5m
-  - continue: true
-    group_by:
-    - env
-    - tier
-    - type
-    - alertname
-    - stage
-    match:
-      rules_domain: general
-    receiver: slack_bridge-nonprod
-    routes:
-    - continue: true
-      match:
-        env: gprd
-      receiver: slack_bridge-prod
   - continue: false
-    group_by:
-    - env
-    - tier
-    - type
-    - alertname
-    - stage
-    match:
-      pager: ''
-      rules_domain: general
-    receiver: slack_alerts_general
-  - continue: true
-    group_by:
-    - env
-    - tier
-    - type
-    - alertname
-    - stage
-    match:
-      pager: pagerduty
-      rules_domain: general
-    receiver: slack_alerts_general
-  - continue: true
-    group_by:
-    - env
-    - alertname
-    - instance
-    - job
-    - stage
-    match:
-      channel: ci-cd
-    receiver: ci-cd_alerts_channel
-    routes:
-    - continue: false
-      match:
-        severity: warn
-      receiver: ci-cd_low_priority_alerts_channel
-    - continue: false
-      match:
-        severity: error
-      receiver: ci-cd_alerts_channel
-  - continue: true
-    group_by:
-    - env
-    - alertname
-    - instance
-    - job
-    - stage
-    match:
-      channel: database
-    receiver: database_alerts_channel
-    routes:
-    - continue: false
-      match:
-        severity: warn
-      receiver: database_low_priority_alerts_channel
-    - continue: false
-      match:
-        severity: error
-      receiver: database_alerts_channel
-  - continue: true
-    group_by:
-    - env
-    - alertname
-    - instance
-    - job
-    - stage
-    match:
-      channel: gitaly
-    receiver: gitaly_alerts_channel
-    routes:
-    - continue: false
-      match:
-        severity: warn
-      receiver: gitaly_low_priority_alerts_channel
-    - continue: false
-      match:
-        severity: error
-      receiver: gitaly_alerts_channel
-  - continue: true
-    group_by:
-    - env
-    - alertname
-    - instance
-    - job
-    - stage
-    match:
-      channel: observability
-    receiver: observability_alerts_channel
-    routes:
-    - continue: false
-      match:
-        severity: warn
-      receiver: observability_low_priority_alerts_channel
-    - continue: false
-      match:
-        severity: error
-      receiver: observability_alerts_channel
-  - continue: true
-    group_by:
-    - env
-    - alertname
-    - stage
     group_interval: 1h
     group_wait: 10m
     match:
+      env: gprd
       pager: issue
+      project: gitlab.com/gitlab-com/gl-infra/infrastructure
+    receiver: issue:gitlab.com/gitlab-com/gl-infra/infrastructure
     repeat_interval: 3d
-    routes:
-    - match:
-        project: gitlab.com/gitlab-com/gl-infra/infrastructure
-      receiver: issue:gitlab.com/gitlab-com/gl-infra/infrastructure
-    - match:
-        project: gitlab.com/gitlab-com/gl-infra/production
-      receiver: issue:gitlab.com/gitlab-com/gl-infra/production
   - continue: false
+    group_interval: 1h
+    group_wait: 10m
+    match:
+      env: gprd
+      pager: issue
+      project: gitlab.com/gitlab-com/gl-infra/production
+    receiver: issue:gitlab.com/gitlab-com/gl-infra/production
+    repeat_interval: 3d
+  - continue: true
     match:
       pager: pagerduty
+    receiver: prod_pagerduty
     routes:
     - continue: false
-      group_by:
-      - env
-      - alertname
-      - stage
-      match_re:
-        env: gstg|dr|pre
+      match:
+        env: gstg
       receiver: non_prod_pagerduty
     - continue: false
       match:
         env: dr
-        slo_alert: 'yes'
-      receiver: slo_dr
-    - continue: true
+      receiver: non_prod_pagerduty
+    - continue: false
+      match:
+        env: pre
+      receiver: non_prod_pagerduty
+    - continue: false
       match:
         env: gprd
         slo_alert: 'yes'
         stage: cny
       receiver: slo_gprd_cny
-    - continue: true
+    - continue: false
       match:
         env: gprd
         slo_alert: 'yes'
         stage: main
       receiver: slo_gprd_main
     - continue: false
-      group_by:
-      - env
-      - tier
-      - type
-      - alertname
-      - stage
       match:
+        env: gprd
         slo_alert: 'yes'
-      receiver: pager_alerts_channel
+        stage: main
+      receiver: slo_gprd_main
+  - continue: true
+    match:
+      rules_domain: general
+    receiver: slack_bridge-nonprod
+    routes:
     - continue: false
       match:
-        slo_alert: 'yes'
-      receiver: slo_non_prod
-    - continue: true
-      group_by:
-      - env
-      - alertname
-      - stage
-      receiver: prod_pagerduty
-    - continue: true
-      group_by:
-      - env
-      - tier
-      - type
-      - alertname
-      - stage
-      receiver: pager_alerts_channel
-  - continue: false
-    group_by:
-    - env
-    - tier
-    - type
-    - alertname
-    - stage
-    receiver: main_alerts_channel
+        env: gprd
+      receiver: slack_bridge-prod
+  - continue: true
+    match:
+      env: gprd
+      team: gitaly
+    receiver: team_gitaly_alerts_channel
+  - continue: true
+    match:
+      env: gprd
+      team: verify
+    receiver: team_verify_alerts_channel
+  - continue: false
+    match:
+      env: pre
+    receiver: nonprod_alerts_slack_channel
+  - continue: false
+    match:
+      env: dr
+    receiver: nonprod_alerts_slack_channel
+  - continue: false
+    match:
+      env: gstg
+    receiver: nonprod_alerts_slack_channel
+  - continue: false
+    match:
+      pager: pagerduty
+    receiver: production_slack_channel
 templates:
 - "/etc/alertmanager/templates/*.tmpl"

cc @brentnewton

Edited by Andrew Newdigate

Merge request reports

Loading