Pipeline durations must be an SLI
One of the most common failure behaviours for incidents is that pipelines aren't progressing through to completion.
This is an extremely common cause of S1/S2 incidents:
https://app.incident.io/gitlab/incidents/8169
https://app.incident.io/gitlab/incidents/8087
https://app.incident.io/gitlab/incidents/7794
https://app.incident.io/gitlab/incidents/6915
There are a number of causes of this, but in almost every case, the notification method for these incidents are a customer (external or internal) because we do not have observability into the functionality of pipelines.
There's also a lot of history here:
https://gitlab.com/gitlab-org/gitlab/-/work_items/524857
However, even that issue above is not adequate, because it also doesn't include all the sidekiq details.
We need an end to end set of SLIs that measures latency end to end for a pipeline, and notifies the EOC. There are a LOT of steps in this, and this is absolutely a complicated problem to solve, but it's been years, it's still an issue, and we still get far too many S1s and S2s related to this.
_This ticket was created from_ [_INC-8169_](https://app.incident.io/gitlab/incidents/8169) _using_ [_incident.io_](https://app.incident.io) 🔥
issue