Improve Pipeline Infrastructure Stability
Identify what leads users to retry jobs in GitLab pipelines, and resolve their causes to reduce user-visible failures.
AC:
- Provide metrics about how often jobs were retried by non-CKI users, instead of herder
- Alert when users retry jobs
- Improve alerts about jobs failing (recognize jobs failing for similar reasons and escalate to sentry/alertmanager)
- Update documentation, regarding how to convert the new alerts to the pipeline-herder rules
Jira: [CKI-7126](https://issues.redhat.com/browse/CKI-7126)
epic