Reliability Managers automation - Bot to manage the Incident Board/Corrective Actions
Background
The Incident Management board that the Reliability managers/director maintain is normally quite cluttered, and requires a good amount of following up and manual actions to bring all the incident issues to completion, via the Incident Management process. We have identified several simple actions that we could automate, via a Gitlab bot - to clean the board, keeping it tidy without needing any manager manual intervention.
Initial ideas for the cleaning/automation activities we could implement:
- Sev3s and 4s - they can be moved to ~Incident::Review-completed or closed, unless review is requested.
- Sev2s - Remind the Reliability Mgr to ping our SREs to complete the write-up.
- Remind the Reliability Mgr to root-cause the incident after a week.
- Linked Corrective Actions: Assign severity, service and team.
- Make sure the incident link is in the CA.
- Fix titles not starting with a date ("YYY-MM-DD").
- More to come.
Possible tools to use:
- Oncall-robot-assistant: https://gitlab.com/gitlab-com/gl-infra/oncall-robot-assistant, with the schedules living here: https://ops.gitlab.net/gitlab-com/gl-infra/oncall-robot-assistant/-/pipeline_schedules.
- This is another option: https://gitlab.com/gitlab-com/gl-infra/triage-ops, maybe better.
- Helicopter: https://gitlab.com/gitlab-com/gl-infra/helicopter
- Account mgmt for all these bots: https://gitlab.com/ops-gitlab-net
Features:
1 - Move Sev3/4s to Resolved and then close
- If no comments for 1 week: move to
incident::Resolved. - If no comments/
Review-requestedlabel for an additional one week: close the incident.
STATUS:
2 - Remind to write up Incident Review for Sev2s
- Write a reminder in the issue for the Incident issue IMOC, so they can chase the EOC to do the work.
- How do we get the Incident issue IMOC?
- The IMOC chases the EOC, from there.
STATUS:
3 - Remind the IMOC to root-cause the incident, after a week
- Write a report - weekly issue - with Incidents with no root cause, no service or no team attribution. Classify it by IMOC.
- Run these reports, assigning them to the corresponding IMOCs, every Monday Morning.
STATUS:
5 - (optional) Add dates to Incident Titles
- Find titles that don't follow the format "YYYY-MM-DD". Add that date string to the title.
STATUS:
DEPRIORITIZED
4 - Complete linked Corrective Actions
- Write a report - weekly issue - with CAs with no severity, no service, no team attribution or no Incident Link in the issue. Classify it by the related Incident IMOC.
- Run these reports, assigning them to the corresponding IMOCs, twice a week: Friday 9am UTC (they are still on call) and next Monday 9am UTC (they just finished their on call).
STATUS: Separate issue created to manage this, #52 (closed).