Reliability Managers automation - Bot to manage the Incident Board/Corrective Actions

Background

The Incident Management board that the Reliability managers/director maintain is normally quite cluttered, and requires a good amount of following up and manual actions to bring all the incident issues to completion, via the Incident Management process. We have identified several simple actions that we could automate, via a Gitlab bot - to clean the board, keeping it tidy without needing any manager manual intervention.

Initial ideas for the cleaning/automation activities we could implement:

  • Sev3s and 4s - they can be moved to ~Incident::Review-completed or closed, unless review is requested.
  • Sev2s - Remind the Reliability Mgr to ping our SREs to complete the write-up.
  • Remind the Reliability Mgr to root-cause the incident after a week.
  • Linked Corrective Actions: Assign severity, service and team.
    • Make sure the incident link is in the CA.
  • Fix titles not starting with a date ("YYY-MM-DD").
  • More to come.

Possible tools to use:

Features:

1 - Move Sev3/4s to Resolved and then close

  • If no comments for 1 week: move to incident::Resolved.
  • If no comments/Review-requested label for an additional one week: close the incident.

STATUS: Done (see this limitation triage-ops!59 (comment 512773971))


2 - Remind to write up Incident Review for Sev2s

  • Write a reminder in the issue for the Incident issue IMOC, so they can chase the EOC to do the work.
    • How do we get the Incident issue IMOC?
  • The IMOC chases the EOC, from there.

STATUS: Done, but unmerged (see this conversation ==> separate issue created to manage this feature, #54 (closed))


3 - Remind the IMOC to root-cause the incident, after a week

  • Write a report - weekly issue - with Incidents with no root cause, no service or no team attribution. Classify it by IMOC.
  • Run these reports, assigning them to the corresponding IMOCs, every Monday Morning.

STATUS: Done (only two minor corrections will come soon). Separate issue created to manage this, #53 (closed).


5 - (optional) Add dates to Incident Titles

  • Find titles that don't follow the format "YYYY-MM-DD". Add that date string to the title.

STATUS: Done


DEPRIORITIZED

4 - Complete linked Corrective Actions

  • Write a report - weekly issue - with CAs with no severity, no service, no team attribution or no Incident Link in the issue. Classify it by the related Incident IMOC.
  • Run these reports, assigning them to the corresponding IMOCs, twice a week: Friday 9am UTC (they are still on call) and next Monday 9am UTC (they just finished their on call).

STATUS: Separate issue created to manage this, #52 (closed).

Edited by Alberto Ramos