@@ -10,7 +10,7 @@ Rotation Leaders are expected to:
-[align according to Infrastructure Platform expectations](/handbook/engineering/infrastructure-platforms/incident-management/on-call/#responsibilities-for-rotation-leaders),
- coordinate the DevOps on-call rotation (adding and removing shifts),
- ensure there are enough team members to [provide adequate coverage](/handbook/engineering/infrastructure-platforms/incident-management/tier2-escalations/),
- ensure there are enough team members to [provide adequate coverage](/handbook/engineering/infrastructure-platforms/incident-management/on-call/tier-2/#coverage-expectations),
- ensure those team members understand their role,
- serve as a point of escalation on the escalation path, and
- conduct regular reviews on the effectiveness of the rotation
@@ -41,7 +41,7 @@ While [general guidance is provided](/handbook/engineering/infrastructure-platfo
Tier 2 Rotations refer to on-call rotations that respond to pages where a human makes a decision to page a team member for support.
@@ -10,9 +12,25 @@ The Tier-2 SME On-Call program enhances incident response by establishing a seco
This program was introduced at GitLab in 2025 with a target of providing 24x7 coverage for areas where specialised domain knowledge will improve incident response. In practise, many teams are not set up to provide this level of cover. As such, we began with a Pilot Program to understand these gaps and learn how to support these teams to achieve this level of cover.
## Active Tier 2 Rotations
## When to escalate to Tier 2
A summary of currently active Tier 2 rotations is listed below. For more detail on expertise and when to escalate to each team, see the [Tier 2 Escalations](/handbook/engineering/infrastructure-platforms/incident-management/tier2-escalations.md) page.
Escalate to a Tier 2 team when:
- The incident requires deep domain expertise in a specific service
- The EOC has identified the problem area but needs specialized assistance
- Performance issues or outages are isolated to a specific subsystem
## How to escalate
To page a Tier 2 team:
1. Use the `/inc escalate` command in Slack or click to escalate in the right sidebar of the incident UI
2. Select the appropriate team from the "Oncall team" dropdown menu
3. Provide a clear message describing the issue and what assistance is needed
## Active Tier 2 rotations
A summary of currently active Tier 2 rotations is listed below.
### Gitaly
@@ -22,6 +40,41 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Escalation History Link: [escalations](https://app.incident.io/gitlab/on-call/escalations?escalation_path%5Bone_of%5D=01JJWB07RXAG02RXYR4QR47J9E)
- Escalation History Link: [escalation](https://app.incident.io/gitlab/on-call/escalations?escalation_path%5Bone_of%5D=01K22CAST6CK8Y4DVN7ET8YQZX)
**Expertise Areas:**
- AI Gateway and Duo feature availability
- Model serving infrastructure and AI feature performance
- Token usage, rate limiting, and AI provider integrations
**When to Escalate:**
- AI features unavailable or degraded
- High error rates from AI services
- Model serving or AI Gateway connectivity issues
---
### DevOps
- Rotation Leader: [see who is on call](https://app.incident.io/gitlab/on-call/schedules/01K611ZT9YX2PSA8WAMEP6A66G)(falls back to Michelle Gill)
@@ -38,6 +105,30 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Slack Channel for Rotation Swaps: [`#tier-2-devops-rotation-swaps`](https://gitlab.enterprise.slack.com/archives/C09LLF79AK0)
- Escalation upon non-response: `@mention` the EM or SEM/Director for the on-call team member who did not respond, using the slack channel [`#tier-2-devops-rotation-swaps`](https://gitlab.enterprise.slack.com/archives/C09LLF79AK0) to ask for additional support. In the event that leadership does not respond, use `@here + msg` in [`#tier-2-devops-rotation-swaps`](https://gitlab.enterprise.slack.com/archives/C09LLF79AK0) requesting help from another available engineer.
DevOps is the name given to a group of features that are part of the Rails monolith.
They should be contacted when assistance is needed with one of the features below.
- Authentication (SAML, LDAP, OAuth login, Access tokens such as PATs/PrAT/GrATs/CI_JOB_TOKENS)
- Authentication (Enterprise users, Service accounts and Cloud Connector authentication)
- Authorization (Custom roles, Granular permissions on CI_JOB_TOKENS/PATs, ProjectAuthorizationWorker)
- Pipeline Security (OIDC with ID tokens, Secrets manager, External Secrets integrations, Build attestations and Cosign integration)
**When to Escalate:**
- Incidents impacting login or authentication to GitLab.com
- Incidents causing severe disruption due to sidekiq overload on permission update workers
- SIRT issues S2 and above that require immediate action from the engineering team to remediate the problem.
- Recent feature additions for secrets manager, granular permissions or authentication services that are degrading availability of GitLab.com
---
### Dev escalation
- This on-call process is designed for GitLab.com operational issues that are escalated by the Infrastructure team.
- Development team currently does NOT use PagerDuty or incident.io for scheduling and paging.
@@ -78,14 +219,20 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Check out [process description and on-call workflow](/handbook/engineering/development/processes/infra-dev-escalation/process/) when escalating GitLab.com operational issue(s).
- Check out more detail for [general information](/handbook/engineering/development/processes/infra-dev-escalation/) of the escalation process.
### Pilot Program
## Coverage expectations
-**24x5 Coverage**: Monday 00:00 UTC through Friday 23:59 UTC
-**Response SLA**: 15 minutes during coverage hours
-**Weekend/Holiday Coverage**: Critical escalations go to IMOC and Infrastructure Leadership
## Pilot program
The Pilot Program aims to cover ordinary working hours with 24x5 coverage. The Pilot was viewed as an acceptable first iteration towards full coverage because 90% of S1 and S2 incidents take place during ordinary working hours.
For the purpose of this program, ordinary working hours means:
1._As close as possible to the 8 hours that you would ordinarily work_
2._Not public holidays or weekends_
1.*As close as possible to the 8 hours that you would ordinarily work*
2.*Not public holidays or weekends*
As described on the main on-call page, rotation leaders can choose an 8-hour cycle that meets their needs. The recommendation is (UTC):
@@ -95,7 +242,7 @@ As described on the main on-call page, rotation leaders can choose an 8-hour cyc
If you have team members that don't naturally align to these times, it is at the rotations leader's discretion for how to manage this situation. It is important to provide coverage, and to enable team members to contribute to on-call in a meaningful way. There will always be circumstances where we need to be flexible - and this flexibility goes both ways.
#### Public Holidays
### Public holidays
It is very difficult for the rotation leader to know the public holidays for every team member in their rotation. It is the team member's responsibility to find coverage if they are scheduled for on-call on a public holiday.
@@ -133,23 +280,23 @@ Rotations in the process of being created and onboarded can be viewed in the [On
### Tier 1 EOC or IM requests
#### Escalation Criteria
#### Escalation criteria
The Tier-1 Engineer On-Call (EOC) will perform initial triage and use available documentation before escalating to Tier-2 SMEs. Pages may also be initiated by the Incident Manager (IM) supporting the incident.
##### Before Escalating to Tier-2
##### Before escalating to Tier-2
Tier-1 must:
1. Follow all recommendations in runbooks and playbooks for the affected area
2. Document attempted solutions and outcomes in the incident issue
-**S1/S2 Incidents**: When the Tier-1 team cannot resolve them independently using runbooks, documentation or other sources. Due to their critical nature, Tier-2 SMEs should expect to be paged for these incidents when domain-specific expertise is needed.
Tier 2 on-call rotations provide specialized subject matter expertise during incident response. These teams serve as escalation points when incidents require domain-specific knowledge beyond the scope of the primary Engineer On Call (EOC).
## When to Escalate to Tier 2
Escalate to a Tier 2 team when:
- The incident requires deep domain expertise in a specific service
- The EOC has identified the problem area but needs specialized assistance
- Performance issues or outages are isolated to a specific subsystem
## How to Escalate
To page a Tier 2 team:
1. Use the `/inc escalate` command in Slack or click to escalate in the right sidebar of the incident UI
2. Select the appropriate team from the "Oncall team" dropdown menu
3. Provide a clear message describing the issue and what assistance is needed
## Available Tier 2 Rotations
### Gitaly
**Expertise Areas:**
- Git repository storage, access, and replication issues
- Gitaly service performance and node failures
- Repository corruption or data integrity concerns