Dogfooding: Use Incident Management in Broken Master Triage Flow

Problem to solve

We currently have a process for dealing with broken master pipelines that is owned by the Quality Department. The Engineering Productivity team is the triage DRI for monitoring, identification and communication of these issues.

See https://about.gitlab.com/handbook/engineering/workflow/#broken-master for more details.

What is the current process for the Broken Master Triage Flow?

graph TD
  A[Pipeline fails for master branch] -->|Using Slack Notifications Integration| B[#broken-master channel notification]
  B --> C[Engineering Productivity team reviews the failing pipeline]
  C --> D[Engineering Productivity team creates an issue with respective labels, identifies MR that introduced the failure]
  D --> E[Engineering Productivity team pings engineer to resolve]
  D --> F[Engineering Productivity team reverts the MR]
  D --> G[Engineering Productivity team creates a quick fix]
  E --> H[Pipeline returns back to green]
  F --> H
  G --> H
  H --> I[Broken master issue is closed]

SLOs being monitored

Triage (Failure ==> Assigned; everything in the chart above) 4 hours
Resolution (Assigned ==> Closed) 4 hours

Proposal (what this would look like to use incident management)

graph TD
  A[Pipeline fails for master branch] -->AA[Using DIY serverless function, relay the information ]
  AA --> B[Generic alert endpoint configured to GitLab project receives the alert]
  B -->|Triggers Slack Notification as new alert has been received| C[#broken-master channel notification]
  B -->|Automatically| D[Creates incident issue based on incoming alert with respective labels]
  C --> E[Engineering Productivity team reviews the failing pipeline and identifies MR that introduced the failure]
  E --> F[Engineering Productivity team pings engineer to resolve]
  E --> G[Engineering Productivity team reverts the MR]
  E --> H[Engineering Productivity team creates a quick fix]
  F --> I[Pipeline returns back to green]
  G --> I
  H --> I
  I --> J[incident issue is closed]

  style AA fill:#c3e6cd
  style D fill:#cbe2f9

Green: Something we need to build
Blue: Workflow that the team no longer needs to perform manually

Benefits of using GitLab's incident management

No need to manually create issue
No list to prioritize/visualize whether the list of broken master issues are meeting the SLO (more info in gitlab-org/gitlab#241663 (closed))

Drawbacks of using GitLab's incident management

Changing existing process
Incident UI will be slightly different than issues
Need to build and host the serverless function (would need to set DRI. Health team can build it but not sure about long term responsibilities)

Notable observations

Incidents uses GitLab issues under the hood. All existing issues API will still work for incidents
Infrastructure team does not track incidents in the gitlab project (because if gitlab.com goes down, so does the incident management tool. They use ops.gitlab instead) so there is currently no collisions for using incidents in the gitlab project

Edited Sep 02, 2020 by Clement Ho