Automate alert manager configuration from service catalog
This MR is a precursor to gitlab-com/gl-infra/scalability#214
Closes gitlab-com/gl-infra/scalability#344
What does this do?
Following on from gitlab-com/gl-infra&234 (closed) and !2379 (merged), we now have a single alertmanager configuration, generated from jsonnet.
At present, however, the routing rules are not very consistent.
In !2546 (merged), we added tests to verify our alertmanager configuration. This was done without changing the configuration, but it was clear that there was not a great deal of consistency in that configuration, as is documented in comments such as this !2546 (diffs)
This MR attempts to add some consistent to our routing configuration.
-
channelconfigurations are replaced withteamconfigurations. Teams are configured in the service catalog.- To add a slack channel for a team, add the
slack_alerts_channelattribute to the team in https://gitlab.com/gitlab-com/runbooks/blob/master/services/service-catalog.yml - Any teams matching on the team label (eg
team=gitaly) will then have duplicate alerts sent to their respective team alerting channels.
- To add a slack channel for a team, add the
- Alerts which are configured with teams are validated to ensure that the team exists in the service manager configuration. At present, many of the channels configured for alerts are non-existent.
- Slack receivers in the routing configuration are automatically generated from service catalog teams.
Routing tree
The routing tree now looks as follows:
Routing tree:
.
└── default-route receiver: prod_alerts_slack_channel
├── {alertname="SnitchHeartBeat",env="ops"} receiver: dead_mans_snitch_ops
├── {alertname="SnitchHeartBeat",env="gprd"} receiver: dead_mans_snitch_gprd
├── {alertname="SnitchHeartBeat",env="gstg"} receiver: dead_mans_snitch_gstg
├── {alertname="SnitchHeartBeat",env="pre"} receiver: dead_mans_snitch_pre
├── {alertname="SnitchHeartBeat",env="testbed"} receiver: dead_mans_snitch_testbed
├── {env="gprd",pager="issue",project="gitlab.com/gitlab-com/gl-infra/infrastructure"} receiver: issue:gitlab.com/gitlab-com/gl-infra/infrastructure
├── {env="gprd",pager="issue",project="gitlab.com/gitlab-com/gl-infra/production"} receiver: issue:gitlab.com/gitlab-com/gl-infra/production
├── {pager="pagerduty"} continue: true receiver: prod_pagerduty
│ ├── {env="gstg"} receiver: non_prod_pagerduty
│ ├── {env="dr"} receiver: non_prod_pagerduty
│ ├── {env="pre"} receiver: non_prod_pagerduty
│ ├── {env="gprd",slo_alert="yes",stage="cny"} receiver: slo_gprd_cny
│ ├── {env="gprd",slo_alert="yes",stage="main"} receiver: slo_gprd_main
│ └── {env="gprd",slo_alert="yes",stage="main"} receiver: slo_gprd_main
├── {rules_domain="general"} continue: true receiver: slack_bridge-nonprod
│ └── {env="gprd"} receiver: slack_bridge-prod
├── {env="gprd",team="gitaly"} continue: true receiver: team_gitaly_alerts_channel
├── {env="gprd",team="verify"} continue: true receiver: team_verify_alerts_channel
├── {env="pre"} receiver: nonprod_alerts_slack_channel
├── {env="dr"} receiver: nonprod_alerts_slack_channel
├── {env="gstg"} receiver: nonprod_alerts_slack_channel
└── {pager="pagerduty"} receiver: production_slack_channel
Routing tree explained
The new routing tree works like this:
- If nothing else matches, all alerts will go to prod_alerts_slack_channel (
#alerts)-
SnitchHeartBeatare routed to dead_mans_snitch and routing terminates. - Issue alerts are routed to GitLab issues and routing terminates.
- Pagerduty alerts are routed to Pagerduty and continue
- Slackline alerts are routed to Slackline and continue
- Team alerts are routed to team slack channels and continue
- Non production alerts are routed to (
#alerts-nonprod) and routing terminates. Note that at present these alerts are being sent to#alerts - Pagerduty alerts (second match) are routed to the
#productionslack channel and routing terminates - Anything that remains will be routed to the
#alertsslack channel and routing terminates
-
How can I be sure it works?
We have tests for our alertmanager config now. Review the tests as a way of ensuring the rules are doing what you would expect.
The tests are in alertmanager/routing-tests.jsonnet
Configuration changes from master
--- alertmanager-master.yml 2020-07-20 19:40:49.000000000 +0200
+++ ./alertmanager.yml 2020-07-20 19:40:49.000000000 +0200
@@ -56,7 +56,7 @@
note: '{{ template "slack.text" . }}'
send_resolved: true
service_key: secret
-- name: main_alerts_channel
+- name: prod_alerts_slack_channel
slack_configs:
- channel: "#alerts"
color: '{{ template "slack.color" . }}'
@@ -65,7 +65,7 @@
text: '{{ template "slack.text" . }}'
title: '{{ template "slack.title" . }}'
title_link: '{{ template "slack.link" . }}'
-- name: pager_alerts_channel
+- name: production_slack_channel
slack_configs:
- channel: "#production"
color: '{{ template "slack.color" . }}'
@@ -74,52 +74,16 @@
text: '{{ template "slack.text" . }}'
title: '{{ template "slack.title" . }}'
title_link: '{{ template "slack.link" . }}'
-- name: ci-cd_alerts_channel
+- name: nonprod_alerts_slack_channel
slack_configs:
- - channel: "#alerts-ci-cd"
- color: '{{ template "slack.color" . }}'
- icon_emoji: '{{ template "slack.icon" . }}'
- send_resolved: true
- text: '{{ template "slack.text" . }}'
- title: '{{ template "slack.title" . }}'
- title_link: '{{ template "slack.link" . }}'
-- name: ci-cd_low_priority_alerts_channel
- slack_configs:
- - channel: "#alerts-ci-cd"
- color: '{{ template "slack.color" . }}'
- icon_emoji: '{{ template "slack.icon" . }}'
- send_resolved: true
- text: '{{ template "slack.text" . }}'
- title: '{{ template "slack.title" . }}'
- title_link: '{{ template "slack.link" . }}'
-- name: database_alerts_channel
- slack_configs:
- - channel: "#database"
- color: '{{ template "slack.color" . }}'
- icon_emoji: '{{ template "slack.icon" . }}'
- send_resolved: true
- text: '{{ template "slack.text" . }}'
- title: '{{ template "slack.title" . }}'
- title_link: '{{ template "slack.link" . }}'
-- name: database_low_priority_alerts_channel
- slack_configs:
- - channel: "#database"
+ - channel: "#alerts-nonprod"
color: '{{ template "slack.color" . }}'
icon_emoji: '{{ template "slack.icon" . }}'
send_resolved: true
text: '{{ template "slack.text" . }}'
title: '{{ template "slack.title" . }}'
title_link: '{{ template "slack.link" . }}'
-- name: gitaly_alerts_channel
- slack_configs:
- - channel: "#g_gitaly"
- color: '{{ template "slack.color" . }}'
- icon_emoji: '{{ template "slack.icon" . }}'
- send_resolved: true
- text: '{{ template "slack.text" . }}'
- title: '{{ template "slack.title" . }}'
- title_link: '{{ template "slack.link" . }}'
-- name: gitaly_low_priority_alerts_channel
+- name: team_gitaly_alerts_channel
slack_configs:
- channel: "#gitaly-alerts"
color: '{{ template "slack.color" . }}'
@@ -128,27 +92,9 @@
text: '{{ template "slack.text" . }}'
title: '{{ template "slack.title" . }}'
title_link: '{{ template "slack.link" . }}'
-- name: observability_alerts_channel
+- name: team_verify_alerts_channel
slack_configs:
- - channel: "#observability"
- color: '{{ template "slack.color" . }}'
- icon_emoji: '{{ template "slack.icon" . }}'
- send_resolved: true
- text: '{{ template "slack.text" . }}'
- title: '{{ template "slack.title" . }}'
- title_link: '{{ template "slack.link" . }}'
-- name: observability_low_priority_alerts_channel
- slack_configs:
- - channel: "#observability"
- color: '{{ template "slack.color" . }}'
- icon_emoji: '{{ template "slack.icon" . }}'
- send_resolved: true
- text: '{{ template "slack.text" . }}'
- title: '{{ template "slack.title" . }}'
- title_link: '{{ template "slack.link" . }}'
-- name: slack_alerts_general
- slack_configs:
- - channel: "#alerts-gen-svc-test"
+ - channel: "#alerts-ci-cd"
color: '{{ template "slack.color" . }}'
icon_emoji: '{{ template "slack.icon" . }}'
send_resolved: true
@@ -205,13 +151,7 @@
send_resolved: true
url: https://gitlab.com/gitlab-com/gl-infra/production/prometheus/alerts/notify.json
route:
- group_by:
- - env
- - tier
- - type
- - alertname
- - stage
- receiver: main_alerts_channel
+ receiver: prod_alerts_slack_channel
repeat_interval: 8h
routes:
- continue: false
@@ -254,200 +194,93 @@
env: testbed
receiver: dead_mans_snitch_testbed
repeat_interval: 5m
- - continue: true
- group_by:
- - env
- - tier
- - type
- - alertname
- - stage
- match:
- rules_domain: general
- receiver: slack_bridge-nonprod
- routes:
- - continue: true
- match:
- env: gprd
- receiver: slack_bridge-prod
- continue: false
- group_by:
- - env
- - tier
- - type
- - alertname
- - stage
- match:
- pager: ''
- rules_domain: general
- receiver: slack_alerts_general
- - continue: true
- group_by:
- - env
- - tier
- - type
- - alertname
- - stage
- match:
- pager: pagerduty
- rules_domain: general
- receiver: slack_alerts_general
- - continue: true
- group_by:
- - env
- - alertname
- - instance
- - job
- - stage
- match:
- channel: ci-cd
- receiver: ci-cd_alerts_channel
- routes:
- - continue: false
- match:
- severity: warn
- receiver: ci-cd_low_priority_alerts_channel
- - continue: false
- match:
- severity: error
- receiver: ci-cd_alerts_channel
- - continue: true
- group_by:
- - env
- - alertname
- - instance
- - job
- - stage
- match:
- channel: database
- receiver: database_alerts_channel
- routes:
- - continue: false
- match:
- severity: warn
- receiver: database_low_priority_alerts_channel
- - continue: false
- match:
- severity: error
- receiver: database_alerts_channel
- - continue: true
- group_by:
- - env
- - alertname
- - instance
- - job
- - stage
- match:
- channel: gitaly
- receiver: gitaly_alerts_channel
- routes:
- - continue: false
- match:
- severity: warn
- receiver: gitaly_low_priority_alerts_channel
- - continue: false
- match:
- severity: error
- receiver: gitaly_alerts_channel
- - continue: true
- group_by:
- - env
- - alertname
- - instance
- - job
- - stage
- match:
- channel: observability
- receiver: observability_alerts_channel
- routes:
- - continue: false
- match:
- severity: warn
- receiver: observability_low_priority_alerts_channel
- - continue: false
- match:
- severity: error
- receiver: observability_alerts_channel
- - continue: true
- group_by:
- - env
- - alertname
- - stage
group_interval: 1h
group_wait: 10m
match:
+ env: gprd
pager: issue
+ project: gitlab.com/gitlab-com/gl-infra/infrastructure
+ receiver: issue:gitlab.com/gitlab-com/gl-infra/infrastructure
repeat_interval: 3d
- routes:
- - match:
- project: gitlab.com/gitlab-com/gl-infra/infrastructure
- receiver: issue:gitlab.com/gitlab-com/gl-infra/infrastructure
- - match:
- project: gitlab.com/gitlab-com/gl-infra/production
- receiver: issue:gitlab.com/gitlab-com/gl-infra/production
- continue: false
+ group_interval: 1h
+ group_wait: 10m
+ match:
+ env: gprd
+ pager: issue
+ project: gitlab.com/gitlab-com/gl-infra/production
+ receiver: issue:gitlab.com/gitlab-com/gl-infra/production
+ repeat_interval: 3d
+ - continue: true
match:
pager: pagerduty
+ receiver: prod_pagerduty
routes:
- continue: false
- group_by:
- - env
- - alertname
- - stage
- match_re:
- env: gstg|dr|pre
+ match:
+ env: gstg
receiver: non_prod_pagerduty
- continue: false
match:
env: dr
- slo_alert: 'yes'
- receiver: slo_dr
- - continue: true
+ receiver: non_prod_pagerduty
+ - continue: false
+ match:
+ env: pre
+ receiver: non_prod_pagerduty
+ - continue: false
match:
env: gprd
slo_alert: 'yes'
stage: cny
receiver: slo_gprd_cny
- - continue: true
+ - continue: false
match:
env: gprd
slo_alert: 'yes'
stage: main
receiver: slo_gprd_main
- continue: false
- group_by:
- - env
- - tier
- - type
- - alertname
- - stage
match:
+ env: gprd
slo_alert: 'yes'
- receiver: pager_alerts_channel
+ stage: main
+ receiver: slo_gprd_main
+ - continue: true
+ match:
+ rules_domain: general
+ receiver: slack_bridge-nonprod
+ routes:
- continue: false
match:
- slo_alert: 'yes'
- receiver: slo_non_prod
- - continue: true
- group_by:
- - env
- - alertname
- - stage
- receiver: prod_pagerduty
- - continue: true
- group_by:
- - env
- - tier
- - type
- - alertname
- - stage
- receiver: pager_alerts_channel
- - continue: false
- group_by:
- - env
- - tier
- - type
- - alertname
- - stage
- receiver: main_alerts_channel
+ env: gprd
+ receiver: slack_bridge-prod
+ - continue: true
+ match:
+ env: gprd
+ team: gitaly
+ receiver: team_gitaly_alerts_channel
+ - continue: true
+ match:
+ env: gprd
+ team: verify
+ receiver: team_verify_alerts_channel
+ - continue: false
+ match:
+ env: pre
+ receiver: nonprod_alerts_slack_channel
+ - continue: false
+ match:
+ env: dr
+ receiver: nonprod_alerts_slack_channel
+ - continue: false
+ match:
+ env: gstg
+ receiver: nonprod_alerts_slack_channel
+ - continue: false
+ match:
+ pager: pagerduty
+ receiver: production_slack_channel
templates:
- "/etc/alertmanager/templates/*.tmpl"
cc @brentnewton
