Alert Management

Summary

We currently have a set of alerting tools that–although very well coordinated–make it difficult to determine the status of incidents. We rely on a workflow that combines Slack, Prometheus Alertmanager(s), PagerDuty, Slackline, and, of course, GitLab–and they work together harmoniously. However, when alert volume is high identifying and isolating an incident from noise is a challenge.

Single Source of Truth

Keeping in line with our values, GitLab issues should be the single source of truth for consolidating incident status. Of course, in the event GitLab is unavailable, we require backup options.

Unified Interface

Ideally there is a single interface in Slack that allows us to communicate back into an issue.

Action Items

outline the use cases for each Slack channel
- remove PagerDuty app from posting in the #production channel
- consolidate the Alertmanager and Slackline posts into #alerts-general
create an issue for automating the communications workflow (possibly by augmenting Slackline, see https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5359)
consolidate any duplicate "howto" runbooks in the gitlab-com/runbooks repository
consolidate handbook pages
set an SLO for communications updates in various channels
automate the creation of an RCA issue when an incident is closed
setup tools with a flag for dry-running the workflow in Slack (Do we need a test Slack account?)

Edited Mar 19, 2019 by AnthonySandoval