Dogfooding: Use Incident Management in Broken Master Triage Flow
Problem to solve
We currently have a process for dealing with broken master pipelines that is owned by the Quality Department. The Engineering Productivity team is the triage DRI for monitoring, identification and communication of these issues.
See https://about.gitlab.com/handbook/engineering/workflow/#broken-master for more details.
What is the current process for the Broken Master Triage Flow?
graph TD
A[Pipeline fails for master branch] -->|Using Slack Notifications Integration| B[#broken-master channel notification]
B --> C[Engineering Productivity team reviews the failing pipeline]
C --> D[Engineering Productivity team creates an issue with respective labels, identifies MR that introduced the failure]
D --> E[Engineering Productivity team pings engineer to resolve]
D --> F[Engineering Productivity team reverts the MR]
D --> G[Engineering Productivity team creates a quick fix]
E --> H[Pipeline returns back to green]
F --> H
G --> H
H --> I[Broken master issue is closed]
SLOs being monitored
- Triage (Failure ==> Assigned; everything in the chart above) 4 hours
- Resolution (Assigned ==> Closed) 4 hours
Proposal (what this would look like to use incident management)
graph TD
A[Pipeline fails for master branch] -->AA[Using DIY serverless function, relay the information ]
AA --> B[Generic alert endpoint configured to GitLab project receives the alert]
B -->|Triggers Slack Notification as new alert has been received| C[#broken-master channel notification]
B -->|Automatically| D[Creates incident issue based on incoming alert with respective labels]
C --> E[Engineering Productivity team reviews the failing pipeline and identifies MR that introduced the failure]
E --> F[Engineering Productivity team pings engineer to resolve]
E --> G[Engineering Productivity team reverts the MR]
E --> H[Engineering Productivity team creates a quick fix]
F --> I[Pipeline returns back to green]
G --> I
H --> I
I --> J[incident issue is closed]
style AA fill:#c3e6cd
style D fill:#cbe2f9
Green: Something we need to build
Blue: Workflow that the team no longer needs to perform manually
Benefits of using GitLab's incident management
- No need to manually create issue
- No list to prioritize/visualize whether the list of broken master issues are meeting the SLO (more info in gitlab-org/gitlab#241663 (closed))
Drawbacks of using GitLab's incident management
- Changing existing process
- Incident UI will be slightly different than issues
- Need to build and host the serverless function (would need to set DRI. Health team can build it but not sure about long term responsibilities)
Notable observations
- Incidents uses GitLab issues under the hood. All existing issues API will still work for incidents
- Infrastructure team does not track incidents in the gitlab project (because if gitlab.com goes down, so does the incident management tool. They use ops.gitlab instead) so there is currently no collisions for using incidents in the gitlab project
Edited by Clement Ho