Proposal that all automatically created first-class citizen incidents be also automatically marked as resolved whenever possible
## Summary statement Consider that all automatically created GitLab first-class citizen `incidents` ought to be also automatically marked as resolved as early possible, based on the resolution of the triggering metric criteria. ## Terminology Here is an example of what I am referring to as a GitLab first-class citizen `incident` page: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4437 I consider the `Summary`, `Metrics`, and `Alert details` tabs to be distinguishing UI compared to a normal GitLab `issue`. Further examples: ![Screen_Shot_2021-05-06_at_9.00.29_AM](/uploads/a079852cabdc0d620735409716b72176/Screen_Shot_2021-05-06_at_9.00.29_AM.png) Source: https://gitlab.com/gitlab-com/gl-infra/production/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name%5B%5D=Incident%3A%3AActive ## Problem to solve In the above screen capture, 8 out of 10 GitLab `incidents` were created automatically. It may very well be the case that all 8 auto-opened `incidents` are in fact still active, and require a human to triage. However, it is my understanding that such `incidents` will be left open until a human has had a chance to examine the `incident` details, make a decision about whether or not to close it, and then manually mark it as `/label ~"Incident::Resolved"` and also manually close the issue. This can leave such `incidents` open for days longer than they are relevant, creating very long, very cluttered SRE on-call handover issues, and making it difficult to get a quick sense of what incidents are actually open and relevant. ## Proposed solution This issue proposes that: - Given that there are metrics which trigger the creation of such an `incident`, - It seems reasonable that there should be metrics which would automatically add a `/label ~"Incident::Resolved"` comment, or even close the incident. ## Further considerations One caveat to this proposal is that an incident should not be automatically resolved until a human has at least *looked* at the issue. This requires some way of marking an automatically created `incident` as having been read. This should be easily done by either: automatically marking the `incident` as `read` whenever the incident is requested by a logged-in user, or even a specific user, such as the EOC; or else by requiring that a human manually mark the incident as read, either by adding a `/label ~"Incident::Acknowledged"` comment, or by clicking an `Acknowledge` button on the first-class citizen `incident` page. Consider this UI component of the GitLab `incident` page: ![Screen_Shot_2021-05-06_at_9.13.19_AM](/uploads/12bbd80fbcd86a5a0b64fb14ba00f535/Screen_Shot_2021-05-06_at_9.13.19_AM.png) Right next to the `Close incident` button seems like a good place to put an `Acknowledge` button. I'd also like to see all such first-class GitLab `incidents` be announced also in the `#production` Slack channel, just like pages from PagerDuty, so that the EOC may easily click a `Acknowledge` button on the Slack post itself, instead of having to click through to the GitLab `incident` page. ## Acceptance criteria This proposal can be considered to be accepted if it drives to completion and delivery additional features in the GitLab `incident` user interface and backend such that: 1. When specified metrics that trigger a `incident` creation event also trigger a closing or resolving event triggered when the relevant metrics "un-crosses" the threshold that originally triggered the creation event. 1. If necessary, an `Acknowledged` button is available for a human to press which applies a label or other state mechanism.
issue