Commit dd3a3fed authored by Michelle Gill's avatar Michelle Gill
Browse files

Consolidate Tier 2 escalation pages

parent b63014df
Loading
Loading
Loading
Loading
+2 −2
Original line number Diff line number Diff line
@@ -10,7 +10,7 @@ Rotation Leaders are expected to:

- [align according to Infrastructure Platform expectations](/handbook/engineering/infrastructure-platforms/incident-management/on-call/#responsibilities-for-rotation-leaders),
- coordinate the DevOps on-call rotation (adding and removing shifts),
- ensure there are enough team members to [provide adequate coverage](/handbook/engineering/infrastructure-platforms/incident-management/tier2-escalations/),
- ensure there are enough team members to [provide adequate coverage](/handbook/engineering/infrastructure-platforms/incident-management/on-call/tier-2/#coverage-expectations),
- ensure those team members understand their role,
- serve as a point of escalation on the escalation path, and  
- conduct regular reviews on the effectiveness of the rotation
@@ -41,7 +41,7 @@ While [general guidance is provided](/handbook/engineering/infrastructure-platfo

### Coverage Hours

[See coverage expectations here.](/handbook/engineering/infrastructure-platforms/incident-management/tier2-escalations/#coverage-expectations)
[See coverage expectations here.](/handbook/engineering/infrastructure-platforms/incident-management/on-call/tier-2/#coverage-expectations)

### Public Holidays

+158 −11
Original line number Diff line number Diff line
---
title: On-Call Processes and Policies - Tier 2
aliases:
  - /handbook/engineering/infrastructure-platforms/incident-management/tier2-escalations/
---

Tier 2 Rotations refer to on-call rotations that respond to pages where a human makes a decision to page a team member for support.
@@ -10,9 +12,25 @@ The Tier-2 SME On-Call program enhances incident response by establishing a seco

This program was introduced at GitLab in 2025 with a target of providing 24x7 coverage for areas where specialised domain knowledge will improve incident response. In practise, many teams are not set up to provide this level of cover. As such, we began with a Pilot Program to understand these gaps and learn how to support these teams to achieve this level of cover.

## Active Tier 2 Rotations
## When to escalate to Tier 2

A summary of currently active Tier 2 rotations is listed below.  For more detail on expertise and when to escalate to each team, see the [Tier 2 Escalations](/handbook/engineering/infrastructure-platforms/incident-management/tier2-escalations.md) page.
Escalate to a Tier 2 team when:

- The incident requires deep domain expertise in a specific service
- The EOC has identified the problem area but needs specialized assistance
- Performance issues or outages are isolated to a specific subsystem

## How to escalate

To page a Tier 2 team:

1. Use the `/inc escalate` command in Slack or click to escalate in the right sidebar of the incident UI
2. Select the appropriate team from the "Oncall team" dropdown menu
3. Provide a clear message describing the issue and what assistance is needed

## Active Tier 2 rotations

A summary of currently active Tier 2 rotations is listed below.

### Gitaly

@@ -22,6 +40,41 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Escalation History Link: [escalations](https://app.incident.io/gitlab/on-call/escalations?escalation_path%5Bone_of%5D=01JJWB07RXAG02RXYR4QR47J9E)
- [More Information](/handbook/engineering/infrastructure-platforms/tenant-scale/gitaly/#on-call-rotation)

**Expertise Areas:**

- Git repository storage, access, and replication issues
- Gitaly service performance and node failures
- Repository corruption or data integrity concerns
- Git operations (clone, fetch, push) failures

**When to Escalate:**

- High error rates on Git operations
- Repository access failures affecting multiple projects
- Gitaly node or cluster issues

---

### Database Operations (DBO)

**Expertise Areas:**

- PostgreSQL performance, replication, and failover
- Query performance issues, deadlocks, and connection pool problems
- Database migrations blocking deployments
- PgBouncer and database capacity issues

**When to Escalate:**

- Database performance degradation or replication lag
- Failed migrations blocking deployments
- Connection pool saturation
- Slow queries impacting application performance

**Coverage:** Best Effort - 24x5 (Monday-Friday)

---

### AI Powered

- Rotation Leader: Martin Wortschack
@@ -29,6 +82,20 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Schedule: [schedule](https://app.incident.io/gitlab/on-call/schedules/01K22BJ3V6C41NW8RJ881B08XZ)
- Escalation History Link: [escalation](https://app.incident.io/gitlab/on-call/escalations?escalation_path%5Bone_of%5D=01K22CAST6CK8Y4DVN7ET8YQZX)

**Expertise Areas:**

- AI Gateway and Duo feature availability
- Model serving infrastructure and AI feature performance
- Token usage, rate limiting, and AI provider integrations

**When to Escalate:**

- AI features unavailable or degraded
- High error rates from AI services
- Model serving or AI Gateway connectivity issues

---

### DevOps

- Rotation Leader: [see who is on call](https://app.incident.io/gitlab/on-call/schedules/01K611ZT9YX2PSA8WAMEP6A66G) (falls back to Michelle Gill)
@@ -38,6 +105,30 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Slack Channel for Rotation Swaps: [`#tier-2-devops-rotation-swaps`](https://gitlab.enterprise.slack.com/archives/C09LLF79AK0)
- Escalation upon non-response: `@mention` the EM or SEM/Director for the on-call team member who did not respond, using the slack channel [`#tier-2-devops-rotation-swaps`](https://gitlab.enterprise.slack.com/archives/C09LLF79AK0) to ask for additional support. In the event that leadership does not respond, use `@here + msg` in [`#tier-2-devops-rotation-swaps`](https://gitlab.enterprise.slack.com/archives/C09LLF79AK0) requesting help from another available engineer.

DevOps is the name given to a group of features that are part of the Rails monolith.
They should be contacted when assistance is needed with one of the features below.

**Teams represented in DevOps Tier 2 on Call:**

CI Platform, Code Review, Container Registry, Environments, Import, Knowledge, Package Registry, Pipeline Authoring, Pipeline Execution, Product Planning, Project Management, Source Code

**Categories/Services represented in DevOps Tier 2 oncall:**

Fleet Visibility, Design Management, Environments, Deployments, Release Management, Importers, Migration, Direct Transfer, Package Registry, Virtual Registry, Dependency Proxy for Containers, Product Planning, Portfolio Management, Requirements Management, Project Management, Issue Tracking, Work Items, Boards, Workspaces, Source Code Management, Repository Management, Protected Branches, Workspaces Rails code, Container Registry Rails Code

**When to Escalate:**

Please do not escalate for general Rails concerns.

- Application-level errors (500s, 422s) with cause inside of one of these features.
- Sidekiq queue backlogs or processing failures where the worker is the responsibility of this group.
- Memory issues in Rails processes originating from this group.
- Application deployment failures requiring rollback where the failure is linked to a feature in this group.

*Note: APAC coverage utilizes IMOC rotation during APAC hours*

---

### Runners Platform

- Rotation Leader: Kam Kyrala
@@ -46,6 +137,22 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Escalation History Link: [escalations](https://app.incident.io/gitlab/on-call/escalations?escalation_path%5Bone_of%5D=01K7HSQ433CMD61V4RNS70BJ47)
- Primary Slack Channel: #g_runners_platform

**Expertise Areas:**

- Runner platform infrastructure and SaaS runner managers
- Job execution issues related to runners (provisioning, startup, teardown)
- Runner registration, capacity, and scheduling concerns
- Runner manager service performance and connectivity

**When to Escalate:**

- Incidents impacting job execution attributable to runners or runner managers
- Widespread runner provisioning failures, hangs, or unexpected timeouts
- Capacity shortfalls or saturation in runner managers affecting customers
- Repeated job failures suspected to be caused by runner platform issues

---

### Fulfillment

- Rotation Leader: James Lopez
@@ -55,6 +162,24 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Primary Slack Channel: #s_fulfillment_engineering
- [More Information](/handbook/engineering/development/fulfillment/#escalation-process-for-incidents-or-outages)

**Expertise Areas:**

- CustomersDot application and purchasing infrastructure
- Subscription management, billing, and provisioning systems
- Usage billing flows and consumption-based pricing
- License generation and validation
- Zuora integration and order processing
- Customer portal and self-service workflows

**When to Escalate:**

- CustomersDot outages or critical errors affecting purchases
- Subscription provisioning or license generation failures
- Billing system integration issues impacting customers
- High error rates in purchase or subscription workflows

---

### Authn/Authz/Pipeline Security

- Rotation Leader: Adil Farrukh
@@ -64,7 +189,23 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Primary Slack Channel: ##s_software-supply-chain-security (or #g_sscs_authentication, #g_sscs_authorization, #g_sscs_pipeline_security)
- [More Information](/handbook/engineering/development/sec/software-supply-chain-security/oncall/)

### Dev Escalation
**Expertise Areas:**

- Authentication (SAML, LDAP, OAuth login, Access tokens such as PATs/PrAT/GrATs/CI_JOB_TOKENS)
- Authentication (Enterprise users, Service accounts and Cloud Connector authentication)
- Authorization (Custom roles, Granular permissions on CI_JOB_TOKENS/PATs, ProjectAuthorizationWorker)
- Pipeline Security (OIDC with ID tokens, Secrets manager, External Secrets integrations, Build attestations and Cosign integration)

**When to Escalate:**

- Incidents impacting login or authentication to GitLab.com
- Incidents causing severe disruption due to sidekiq overload on permission update workers
- SIRT issues S2 and above that require immediate action from the engineering team to remediate the problem.
- Recent feature additions for secrets manager, granular permissions or authentication services that are degrading availability of GitLab.com

---

### Dev escalation

- This on-call process is designed for GitLab.com operational issues that are escalated by the Infrastructure team.
- Development team currently does NOT use PagerDuty or incident.io for scheduling and paging.
@@ -78,14 +219,20 @@ A summary of currently active Tier 2 rotations is listed below. For more detail
- Check out [process description and on-call workflow](/handbook/engineering/development/processes/infra-dev-escalation/process/) when escalating GitLab.com operational issue(s).
- Check out more detail for [general information](/handbook/engineering/development/processes/infra-dev-escalation/) of the escalation process.

### Pilot Program
## Coverage expectations

- **24x5 Coverage**: Monday 00:00 UTC through Friday 23:59 UTC
- **Response SLA**: 15 minutes during coverage hours
- **Weekend/Holiday Coverage**: Critical escalations go to IMOC and Infrastructure Leadership

## Pilot program

The Pilot Program aims to cover ordinary working hours with 24x5 coverage. The Pilot was viewed as an acceptable first iteration towards full coverage because 90% of S1 and S2 incidents take place during ordinary working hours.

For the purpose of this program, ordinary working hours means:

1. _As close as possible to the 8 hours that you would ordinarily work_
2. _Not public holidays or weekends_
1. *As close as possible to the 8 hours that you would ordinarily work*
2. *Not public holidays or weekends*

As described on the main on-call page, rotation leaders can choose an 8-hour cycle that meets their needs. The recommendation is (UTC):

@@ -95,7 +242,7 @@ As described on the main on-call page, rotation leaders can choose an 8-hour cyc

If you have team members that don't naturally align to these times, it is at the rotations leader's discretion for how to manage this situation. It is important to provide coverage, and to enable team members to contribute to on-call in a meaningful way. There will always be circumstances where we need to be flexible - and this flexibility goes both ways.

#### Public Holidays
### Public holidays

It is very difficult for the rotation leader to know the public holidays for every team member in their rotation. It is the team member's responsibility to find coverage if they are scheduled for on-call on a public holiday.

@@ -133,23 +280,23 @@ Rotations in the process of being created and onboarded can be viewed in the [On

### Tier 1 EOC or IM requests

#### Escalation Criteria
#### Escalation criteria

The Tier-1 Engineer On-Call (EOC) will perform initial triage and use available documentation before escalating to Tier-2 SMEs. Pages may also be initiated by the Incident Manager (IM) supporting the incident.

##### Before Escalating to Tier-2
##### Before escalating to Tier-2

Tier-1 must:

1. Follow all recommendations in runbooks and playbooks for the affected area
2. Document attempted solutions and outcomes in the incident issue

###### Resource Locations
###### Resource locations

- [Runbooks](https://gitlab.com/gitlab-com/runbooks/-/tree/master/docs)
- [Playbooks](https://internal.gitlab.com/handbook/engineering/tier2-oncall/playbooks/)

#### By Severity Level
#### By severity level

- **S1/S2 Incidents**: When the Tier-1 team cannot resolve them independently using runbooks, documentation or other sources. Due to their critical nature, Tier-2 SMEs should expect to be paged for these incidents when domain-specific expertise is needed.

+3 −178
Original line number Diff line number Diff line
---
title: Tier 2 Escalations
aliases:
  - /handbook/engineering/infrastructure-platforms/incident-management/tier2-escalations/
---

## Overview

Tier 2 on-call rotations provide specialized subject matter expertise during incident response. These teams serve as escalation points when incidents require domain-specific knowledge beyond the scope of the primary Engineer On Call (EOC).

## When to Escalate to Tier 2

Escalate to a Tier 2 team when:

- The incident requires deep domain expertise in a specific service
- The EOC has identified the problem area but needs specialized assistance
- Performance issues or outages are isolated to a specific subsystem

## How to Escalate

To page a Tier 2 team:

1. Use the `/inc escalate` command in Slack or click to escalate in the right sidebar of the incident UI
2. Select the appropriate team from the "Oncall team" dropdown menu
3. Provide a clear message describing the issue and what assistance is needed

## Available Tier 2 Rotations

### Gitaly

**Expertise Areas:**

- Git repository storage, access, and replication issues
- Gitaly service performance and node failures
- Repository corruption or data integrity concerns
- Git operations (clone, fetch, push) failures

**When to Escalate:**

- High error rates on Git operations
- Repository access failures affecting multiple projects
- Gitaly node or cluster issues

**Coverage:** 24x5 (Monday-Friday, business hours)

---

### Database Operations (DBO)

**Expertise Areas:**

- PostgreSQL performance, replication, and failover
- Query performance issues, deadlocks, and connection pool problems
- Database migrations blocking deployments
- PgBouncer and database capacity issues

**When to Escalate:**

- Database performance degradation or replication lag
- Failed migrations blocking deployments
- Connection pool saturation
- Slow queries impacting application performance

**Coverage:** Best Effort - 24x5 (Monday-Friday)

---

### AI

**Expertise Areas:**

- AI Gateway and Duo feature availability
- Model serving infrastructure and AI feature performance
- Token usage, rate limiting, and AI provider integrations

**When to Escalate:**

- AI features unavailable or degraded
- High error rates from AI services
- Model serving or AI Gateway connectivity issues

**Coverage:** 24x5 (Monday-Friday, business hours)

---

### DevOps

DevOps is the name given to a group of features that are part of the Rails monolith.
They should be contacted when assistance is needed with one of the features below.

**Teams represented in DevOps Tier 2 on Call:**

CI Platform, Code Review, Container Registry, Environments, Import, Knowledge, Package Registry,Pipeline Authoring, Pipeline Execution, Product Planning, Project Management, Source Code

**Categories/Services represented in DevOps Tier 2 oncall:**

Fleet Visibility, Design Management, Environments, Deployments, Release Management, Importers, Migration, Direct Transfer, Package Registry, Virtual Registry, Dependency Proxy for Containers, Product Planning, Portfolio Management, Requirements Management, Project Management, Issue Tracking, Work Items, Boards, Workspaces, Source Code Management, Repository Management, Protected Branches, Workspaces Rails code, Container Registry Rails Code

**When to Escalate:**

Please do not escalate for general Rails concerns.

- Application-level errors (500s, 422s) with cause inside of one of these features.
- Sidekiq queue backlogs or processing failures where the worker is the responsibility of this group.
- Memory issues in Rails processes originating from this group.
- Application deployment failures requiring rollback where the failure is linked to a feature in this group.

**Coverage:** 24x5 (Monday-Friday, business hours)
*Note: APAC coverage utilizes IMOC rotation during APAC hours*

---

### Runners Platform

**Expertise Areas:**

- Runner platform infrastructure and SaaS runner managers
- Job execution issues related to runners (provisioning, startup, teardown)
- Runner registration, capacity, and scheduling concerns
- Runner manager service performance and connectivity

**When to Escalate:**

- Incidents impacting job execution attributable to runners or runner managers
- Widespread runner provisioning failures, hangs, or unexpected timeouts
- Capacity shortfalls or saturation in runner managers affecting customers
- Repeated job failures suspected to be caused by runner platform issues

**Coverage:** Best Effort - 24x5 (Monday-Friday)

---

### Fulfillment

**Expertise Areas:**

- CustomersDot application and purchasing infrastructure
- Subscription management, billing, and provisioning systems
- Usage billing flows and consumption-based pricing
- License generation and validation
- Zuora integration and order processing
- Customer portal and self-service workflows

**When to Escalate:**

- CustomersDot outages or critical errors affecting purchases
- Subscription provisioning or license generation failures
- Billing system integration issues impacting customers
- High error rates in purchase or subscription workflows

**Coverage:** 24x5 (Monday-Friday, business hours)

---

### Authn/Authz/Pipeline Security

**Expertise Areas:**

- Authentication (SAML, LDAP, OAuth login, Access tokens such as PATs/PrAT/GrATs/CI_JOB_TOKENS)
- Authentication (Enterprise users, Service accounts and Cloud Connector authentication)
- Authorization (Custom roles, Granular permissions on CI_JOB_TOKENS/PATs, ProjectAuthorizationWorker)
- Pipeline Security (OIDC with ID tokens, Secrets manager, External Secrets integrations, Build attestations and Cosign integration)

**When to Escalate:**

- Incidents impacting login or authentication to GitLab.com
- Incidents causing severe distruption due to sidekiq overload on permission update workers
- SIRT issues S2 and above that reuqire immediate action from the engineering team to remediate the problem.
- Recent feature additions for secrets manager, granular permissions or authentication services that are degrading availability of GitLab.com

**Coverage:** 24x5 (Monday-Friday, business hours but best effort for APAC)

---

## Coverage Expectations

- **24x5 Coverage**: Monday 00:00 UTC through Friday 23:59 UTC
- **Response SLA**: 15 minutes during coverage hours
- **Weekend/Holiday Coverage**: Critical escalations go to IMOC and Infrastructure Leadership

## Related Pages

- [Incident Management](/handbook/engineering/infrastructure-platforms/incident-management/)
- [Tier 2 On-Call](/handbook/engineering/infrastructure-platforms/incident-management/on-call/tier-2.md)
This page has moved to [Tier 2 On-Call](/handbook/engineering/infrastructure-platforms/incident-management/on-call/tier-2/).