Improve Pipeline Infrastructure Stability
Identify what leads users to retry jobs in GitLab pipelines, and resolve their causes to reduce user-visible failures. AC: - Provide metrics about how often jobs were retried by non-CKI users, instead of herder - Alert when users retry jobs - Improve alerts about jobs failing (recognize jobs failing for similar reasons and escalate to sentry/alertmanager) - Update documentation, regarding how to convert the new alerts to the pipeline-herder rules Jira: [CKI-7126](https://issues.redhat.com/browse/CKI-7126)
epic