Skip to content

Incident issue is in closed state even though the incident is active, potentially causing production checks to be wrong

There are two types of bots, and they are fighting over the Incident:: labels in incident issues once an incident is closed.

When the label Incident::Resolved is applied by incident.io, @gitlab-bot will close the incident issue. It further expects that this incident will not go back to the Incident::Active state. However, within incident.io, it is possible to go back to an active state!

When an incident goes from Resolved => Active, the following dance ensues:

  • incident.io applies Incident::Active on a closed issue
  • @gitlab-bot applies Incident::Resolved on the same issue immediately after 😅

image

@gitlab-bot is configured using Triage Ops and the relevant ruels are defined here in production/auto-close.yml.

This has potential to cause a real problem for release/tools because within our production checks code, we check whether there is an active incident which blocks deployments. This check only looks for "open" incidents:

https://gitlab.com/gitlab-org/release-tools/blob/fe06ca090a54ae76a74745a6ab2b046ca03d20e8/lib/release_tools/promotion/checks/production_issue_tracker.rb#L20-39

So, it is possible that the incident lead sets the incident as active from Incident.io, such as the Investigating state, and selects the incident.io property to block deployments, but this incident will not be considered a production check blocker because the GitLab incident is closed.

Take the following incident as an example:

https://app.incident.io/gitlab/incidents/518 | production#19691 (closed)

image

Acceptance criterion

  • Only one bot should manage the Incident:: labels on GitLab incident issues
  • This bot should re-open incident issues in GitLab when an incident goes from Resolved => Active state
Edited by Siddharth Kannan