Pipeline Observability First Iteration
Overview
The Distribution team wants to create a single view of all automatically created pipeline issues similar to the merge request view.
Context
Over the course of several months, Distribution has had bi-weekly pipeline failure reviews to triage and determine root causes of various problems.
The issue bot project was created to automatically create issues from pipeline failures for better data preservation versus alerting in Slack.
We have a larger spike open and unscheduled to investigate pipeline observability, but in the short term we identified the following issues which have small iterations we can do for the interim:
- Reduced awareness of pipeline failures outside of triage because they are not alerted in Slack
- Finding pipeline issues is more difficult because they are spread across multiple projects
- Pipeline issues grow stale because they are not easily tracked
Related: Spike: Pipeline Plans, Observability, and Goals (#945)
Proposal
- Create a page similar to our merge request monitoring that tracks pipeline failure issues
- Create a Slack alert on a set frequency that lets the team know about stale pipeline issues
Deliverables
- Page that distribution can track similar to MR monitoring that requires login
- Slack alert for pipeline failure issues that are older than a specified date threshold