Proposal that all automatically created first-class citizen incidents be also automatically marked as resolved whenever possible

Summary statement

Consider that all automatically created GitLab first-class citizen incidents ought to be also automatically marked as resolved as early possible, based on the resolution of the triggering metric criteria.

Terminology

Here is an example of what I am referring to as a GitLab first-class citizen incident page:

production#4437 (closed)

I consider the Summary, Metrics, and Alert details tabs to be distinguishing UI compared to a normal GitLab issue.

Further examples:

Screen_Shot_2021-05-06_at_9.00.29_AM

Source: https://gitlab.com/gitlab-com/gl-infra/production/-/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name%5B%5D=Incident%3A%3AActive

Problem to solve

In the above screen capture, 8 out of 10 GitLab incidents were created automatically.

It may very well be the case that all 8 auto-opened incidents are in fact still active, and require a human to triage. However, it is my understanding that such incidents will be left open until a human has had a chance to examine the incident details, make a decision about whether or not to close it, and then manually mark it as /label ~"Incident::Resolved" and also manually close the issue. This can leave such incidents open for days longer than they are relevant, creating very long, very cluttered SRE on-call handover issues, and making it difficult to get a quick sense of what incidents are actually open and relevant.

Proposed solution

This issue proposes that:

  • Given that there are metrics which trigger the creation of such an incident,
  • It seems reasonable that there should be metrics which would automatically add a /label ~"Incident::Resolved" comment, or even close the incident.

Further considerations

One caveat to this proposal is that an incident should not be automatically resolved until a human has at least looked at the issue. This requires some way of marking an automatically created incident as having been read. This should be easily done by either: automatically marking the incident as read whenever the incident is requested by a logged-in user, or even a specific user, such as the EOC; or else by requiring that a human manually mark the incident as read, either by adding a /label ~"Incident::Acknowledged" comment, or by clicking an Acknowledge button on the first-class citizen incident page.

Consider this UI component of the GitLab incident page:

Screen_Shot_2021-05-06_at_9.13.19_AM

Right next to the Close incident button seems like a good place to put an Acknowledge button.

I'd also like to see all such first-class GitLab incidents be announced also in the #production Slack channel, just like pages from PagerDuty, so that the EOC may easily click a Acknowledge button on the Slack post itself, instead of having to click through to the GitLab incident page.

Acceptance criteria

This proposal can be considered to be accepted if it drives to completion and delivery additional features in the GitLab incident user interface and backend such that:

  1. When specified metrics that trigger a incident creation event also trigger a closing or resolving event triggered when the relevant metrics "un-crosses" the threshold that originally triggered the creation event.
  2. If necessary, an Acknowledged button is available for a human to press which applies a label or other state mechanism.
Edited by Nels Nelson