Phase 1 - Incident Routing for Cells On-Call
## Summary This is the second phase of the [project to define an on-call process for Cells](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1787). The Incident Routing phase will define how alerts are routed to pagers, and how we create incidents from them. ## DRI @devin ## Objectives 1. Alertmanager Integration with incident.io for automated incident creation - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/28061 2. Cells team Tier 2 Escalation: https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/work_items/28005 ## Deliverables ### Tier 2 Rotation - Best effort Tier 2 rotation schedule - Escalation path which can be selected by EOC ### Alertmanager Integration - incident.io configured as an alertmanager destination - Metadata added to alerts so that incident.io can filter and route them ## Key Questions to Answer - Will we set up Alertmanager routing just for Cells, or standardize it across all of the Dedicated tooling - ## Exit Criteria - [ ] Tier 2 - Cells Rotation Schedule - [ ] Cells Alertmanager can send alerts to incident.io ## Timeline Target: 2-3 weeks of working time to allow adequate implementation time before Protocells launch. The due date takes into account the PTO scheduled around the holidays. ## Related Links - [Parent Epic: Establish On-Call Process for GitLab Cells (#1787)](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1787) - [Protocells Epic (#1616)](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1616) - [Cells Architecture Design](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/cells/) ## Issue Admin ``` /labels ~"group::Networking & Incident Management" ~"workflow-infra::Triage" ``` <!-- STATUS NOTE START --> ## Status 2026-02-19 :clock1: **total hours spent this week by all contributors**: 4 :tada: **achievements**: - Cells team is set up as a best effort Tier 2 on-call rotation, so we can page them when we're ready :warning: **change in plan** - The cells team is now targeting [Q3 for Protocells](https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1787#note_3099229073) for production. This will push creation of the documentation into Q2, and temporarily reduce their focus on providing the things Incident Management needs to proceed. :issue-blocked: **blockers**: - Waiting for Cells and Observability teams to [finalize sending Cells alerts to incident.io](https://gitlab.com/gitlab-com/gl-infra/tenant-scale/cells-infrastructure/team/-/issues/616#note_3003480408) so we can take action on them - Also waiting on the creation of a [LevelUP Training for the EOC's](https://gitlab.com/gitlab-com/gl-infra/tenant-scale/tenant-services/team/-/work_items/357) _Copied from https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/1789#note_3095952419_ <!-- STATUS NOTE END -->
epic