Alert Management
Summary
We currently have a set of alerting tools that–although very well coordinated–make it difficult to determine the status of incidents. We rely on a workflow that combines Slack, Prometheus Alertmanager(s), PagerDuty, Slackline, and, of course, GitLab–and they work together harmoniously. However, when alert volume is high identifying and isolating an incident from noise is a challenge.
Single Source of Truth
Keeping in line with our values, GitLab issues should be the single source of truth for consolidating incident status. Of course, in the event GitLab is unavailable, we require backup options.
Unified Interface
Ideally there is a single interface in Slack that allows us to communicate back into an issue.
Action Items
-
outline the use cases for each Slack channel -
remove PagerDuty app from posting in the #production channel -
consolidate the Alertmanager and Slackline posts into #alerts-general
-
-
create an issue for automating the communications workflow (possibly by augmenting Slackline, see https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5359) -
consolidate any duplicate "howto" runbooks in the gitlab-com/runbooks repository -
consolidate handbook pages -
set an SLO for communications updates in various channels -
automate the creation of an RCA issue when an incident is closed -
setup tools with a flag for dry-running the workflow in Slack (Do we need a test Slack account?)
Edited by AnthonySandoval