Add architecture doc for master-broken incident system

Context

The master-broken incident flow in triage-ops spans multiple files across 3 directories. Understanding the end-to-end flow requires reading ~10 source files. This has been re-explored multiple times.

What's in this MR?

Adds triage/triage/pipeline_failure/ARCHITECTURE.md — placed next to the code it documents. Covers:

  • End-to-end flow from webhook to escalation
  • All 8 config classes with matching rules
  • Auto-triage logic (~7 root cause patterns)
  • Duplicate detection mechanism
  • Slack notification routing
  • Escalation SLA timeline (10min, 30min, 3h40, 4h)
  • Related projects (Observer, ci-alerts, EP-infra)

Merge request reports

Loading