Notification mechanism for observability issue alerts
Split out of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11571.
From that issue:
Nice To Have: each of the Incident issues is assigned to
@sre-observability
Rather than jump straight to a solution, let's consider the problem:
- Alertmanager will currently open issues in https://gitlab.com/gitlab-com/gl-infra/production/issues for SLI breaches that are related to the observability team in the service catalog.
- Non-generic alerts (outside of the metrics-catalog) can also open issues in the production or infrastructure issue trackers, if they are labelled as such by
incident_project. - This is orthogonal to paging-ness: most SLI breaches will also page the on-call, and any non-generic alert can be made paging.
- Today, we have one on-call, that handles alerts we've deemed urgent (paging), even if they end up assigning the consequential work to a team. Customarily, the on-call opens an issue with a Slack command, which would duplicate issues if one has already been created.
- Non-urgent alerts that open incident issues still need to be attended to at some point. Today, I doubt any team member subscribes to the production issue tracker. The signal:noise of such a move would be far too low.
Solution 1: Observability incident issue tracker
Create a new GitLab project for observability incidents, and change our team's issue tracker in the service catalog to this.
- SREs in the o11y team could subscribe to this project. This causes the whole team to get an async, non-paging notification that there is work to be done, and someone can assign themselves (or the engineering manager can).
- We would still maintain a separate issue tracker for project work, even if one day we split out of the infra tracker. We need to keep the signal:noise in trackers the whole team is expected to subscribe to high, and getting notifications for every piece of project work would not help that.
Solution 2: Auto-assign o11y incident issues to the whole team
- Would likely require extending the GitLab alertmanager webhook receiver to support assignment.
- It's not clear that alertmanager's webhook configuration is expressive enough to pass the necessary information to a receiver.
- Assignment to everyone can devolve into assignment to no-one.
- To avoid duplicating work, someone would have to express their "claiming" of the issue by unassigning everyone else.
Comparing the 2 solutions, I favor solution 1. Discussion welcome!
Paging incident issues
Additionally, we could add a link to the incident issue in alert slack notifications using the incident_project label. That would help the SRE on-call find issues that have already been created for some alerts, and avoid duplication.
@gitlab-com/gl-infra/sre-observability wdyt?