Notification mechanism for observability issue alerts

Split out of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11571.

From that issue:

Nice To Have: each of the Incident issues is assigned to @sre-observability

Rather than jump straight to a solution, let's consider the problem:

  1. Alertmanager will currently open issues in https://gitlab.com/gitlab-com/gl-infra/production/issues for SLI breaches that are related to the observability team in the service catalog.
  2. Non-generic alerts (outside of the metrics-catalog) can also open issues in the production or infrastructure issue trackers, if they are labelled as such by incident_project.
  3. This is orthogonal to paging-ness: most SLI breaches will also page the on-call, and any non-generic alert can be made paging.
  4. Today, we have one on-call, that handles alerts we've deemed urgent (paging), even if they end up assigning the consequential work to a team. Customarily, the on-call opens an issue with a Slack command, which would duplicate issues if one has already been created.
  5. Non-urgent alerts that open incident issues still need to be attended to at some point. Today, I doubt any team member subscribes to the production issue tracker. The signal:noise of such a move would be far too low.

Solution 1: Observability incident issue tracker

Create a new GitLab project for observability incidents, and change our team's issue tracker in the service catalog to this.

  1. SREs in the o11y team could subscribe to this project. This causes the whole team to get an async, non-paging notification that there is work to be done, and someone can assign themselves (or the engineering manager can).
  2. We would still maintain a separate issue tracker for project work, even if one day we split out of the infra tracker. We need to keep the signal:noise in trackers the whole team is expected to subscribe to high, and getting notifications for every piece of project work would not help that.

Solution 2: Auto-assign o11y incident issues to the whole team

  1. Would likely require extending the GitLab alertmanager webhook receiver to support assignment.
  2. It's not clear that alertmanager's webhook configuration is expressive enough to pass the necessary information to a receiver.
  3. Assignment to everyone can devolve into assignment to no-one.
  4. To avoid duplicating work, someone would have to express their "claiming" of the issue by unassigning everyone else.

Comparing the 2 solutions, I favor solution 1. Discussion welcome!

Paging incident issues

Additionally, we could add a link to the incident issue in alert slack notifications using the incident_project label. That would help the SRE on-call find issues that have already been created for some alerts, and avoid duplication.


@gitlab-com/gl-infra/sre-observability wdyt?