Commit 9daad5ce authored by John Jarvis's avatar John Jarvis
Browse files

Adds platform-leadership review for C1/C2 changes

parent 500f1256
Loading
Loading
Loading
Loading
+4 −2
Original line number Diff line number Diff line
---
title: "Change Management"
aliases:
  - /handbook/engineering/infrastructure-platforms/change-management.md
---

## Purpose
@@ -106,7 +108,7 @@ These are changes with high impact or high risk. If a change is going to cause d
   1. Plannable C1s - including but not limited to proactive maintenance, scheduled upgrades, and planned infrastructure changes - require a minimum 2-week lead time to ensure proper stakeholder notification.
1. Changes which include downtime must be pre-communicated to users. Follow the guidance for [Communicating a change that requires downtime](/handbook/engineering/infrastructure-platforms/change-management/#communicating-a-change-that-requires-downtime-maintenance-window)
1. All the database changes related should have a review by a DBRE.
1. Have the change approved by Infrastructure management at the Sr. Manager level or above by obtaining the `manager_approved` label on the Change Request issue. Mention `@gitlab-org/saas-platforms/inframanagers` to request approval and provide visbility to all SaaS Platforms infrastructure managers.
1. Have the change approved by Infrastructure Platform Leadership (EM+ or Staff+ ICs in Infrastructure Platforms) by obtaining the `platform_leadership_approved` label on the Change Request issue. Mention `@gitlab-org/saas-platforms/change-review-leadership` to request approval.The reviewer should be from a different team than the change author. Review the [Platform Leadership Review Guidelines](platform-leadership-review/) before approving.
1. Identify the Engineer On-Call (EOC) scheduled for the time of the change and make them aware of the change plan as soon as it is scheduled.
(The source is [incident.io](https://app.incident.io/gitlab/on-call/schedules/01K5YWAGZ7YCQGAG7ATQ9XQWHW), if you don't have access try [getting assistance](/handbook/engineering/infrastructure/team/))
1. Mention `@release-managers` in slack or `@gitlab-org/release/managers` in the issue to make sure you can pause deployments or migrations as needed by the change.
@@ -139,7 +141,7 @@ These are changes that are not expected to cause downtime in Production, but whi
1. Ensure there is a Due Date to the issue and an event to the [GitLab Production](https://calendar.google.com/calendar/embed?src=gitlab.com_si2ach70eb1j65cnu040m3alq0%40group.calendar.google.com) calendar.
1. Changes which include downtime must be pre-communicated to users. Follow the guidance for [Communicating a change that requires downtime](/handbook/engineering/infrastructure-platforms/change-management/#communicating-a-change-that-requires-downtime-maintenance-window)
1. All the database changes related should have a review by a DBRE.
1. Have the change approved by Infrastructure management at the manager level or above by obtaining the `manager_approved` label on the Change Request issue. Mention `@gitlab-org/saas-platforms/inframanagers` to request approval and provide visbility to all SaaS Platforms infrastructure managers.
1. Have the change approved by Platform Leadership (Staff+ SREs, Principal Engineers, or Senior Staff Engineers) by obtaining the `platform_leadership_approved` label on the Change Request issue. Mention `@gitlab-org/saas-platforms/change-review-leadership` to request approval. The reviewer should be from a different team than the change author. Review the [Platform Leadership Review Guidelines](platform-leadership-review/) before approving.
1. Identify the Engineer On-Call (EOC) scheduled for the time of the change and review the plan with them.
(The source is [incident.io](https://app.incident.io/gitlab/on-call/schedules/01K5YWAGZ7YCQGAG7ATQ9XQWHW), if you don't have access try [getting assistance](/handbook/engineering/infrastructure/team/))
1. Announce the start of the plan execution in the `#production` Slack channel directly notifying the EOC using the `@sre-oncall` alias and have the change approved by the EOC by obtaining the `eoc_approved` label on the Change Request issue.
+57 −0
Original line number Diff line number Diff line
---
title: "Platform Leadership Review Guidelines"
---

## Purpose

This page provides guidelines for Platform Leadership reviewers when approving C1 and C2 change requests. The goal is to ensure consistency in reviews and improve the quality of high-criticality changes.

## Who Can Approve

Platform Leadership approval (`platform_leadership_approved` label) can be provided by both ICs and EMs in Infrastructure Platforms. The members who are eligible are:

- All Engineering Managers EM+
- Staff+ ICs (SREs and backend engineers who are Staff+ including Principal)

The reviewer should be from a different team than the change author. This ensures an independent perspective and helps catch issues that may be overlooked by those close to the work.

The list of those who can approve are in the `@gitlab-org/saas-platforms/change-review-leadership` group.
Mention this group on the change request issue to get an approval.

## Review Guidelines

When reviewing a C1 or C2 change request, verify the following:

### 1. Pre-Production Validation

The motivation for why the change is needed should be clear from the description, with a link to an issue for further details.

The same change plan should have been executed in a non-production environment using the same steps.

- Confirm the change has been tested in staging or another non-production environment
- Verify the test environment execution used identical steps to those proposed for production
- Check that any issues discovered in non-production testing have been addressed

### 2. Rollback Plan Quality

The rollback plan should be documented and detailed enough to be executed by any SRE.

- Verify the rollback plan is explicitly documented in the change request
- Ensure rollback steps are specific and actionable (not vague statements like "revert the change")
- Confirm an SRE unfamiliar with the change could execute the rollback without additional context
- Check that rollback time estimates are provided and realistic

### 3. Monitoring and Validation

Monitoring links should be provided as well as explicit change steps for checking monitoring and what to validate if applicable.

- Verify relevant monitoring dashboard links are included
- Confirm the change plan includes explicit steps for checking monitoring during and after execution
- Check that success criteria are defined (what does "this change worked" look like?)
- Ensure there are clear indicators for when to trigger the rollback plan

## Applying the Label

Once you have verified the change request meets all the above criteria, apply the `platform_leadership_approved` label to the issue.

If the change request does not meet the criteria, provide specific feedback on what needs to be improved before approval can be granted.
+2 −2
Original line number Diff line number Diff line
@@ -11,7 +11,7 @@ Rate limited requests will return a `429 - Too Many Requests` response.

## Processes

- Changes to rate limits require a [Change Request](../change-management.md/#change-request-workflows).
- Changes to rate limits require a [Change Request](../change-management/#change-request-workflows).
- Request assistance for a user's rate limiting settings with [this issue template](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/new?issuable_template=request-rate-limiting).
- For internal teams seeking a bypass, please refer to the [Rate Limit Bypass Policy](/handbook/engineering/infrastructure-platforms/rate-limiting/bypass-policy/).

@@ -187,7 +187,7 @@ For a full list of conditions where the header will be applied, see [this config

Our [Cloudflare runbook](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/cloudflare/) contains more detail on configuring this layer of our infrastructure.

Changes to Cloudflare rate limits require a [Change Request](../change-management.md/#change-request-workflows), and should
Changes to Cloudflare rate limits require a [Change Request](../change-management/#change-request-workflows), and should
be discussed with the [Production Engineering::Networking & Incident Management](https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/new?issuable_template=request-rate-limiting) SRE team before implementing.

### HAProxy