Enable Workhorse load shedding
<!--
Please review https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/ for the most recent information on our change plans and execution policies.
-->
# Production Change
## Change Summary
We want to enable the Workhorse load shedder added in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/218865.
This is to address https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/28055.
## Change Details
<!--
To automatically add your change to the GitLab Production calendar update the following fields:
- Time tracking
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM)
Bot: https://gitlab.com/gitlab-com/gl-infra/ops-team/toolkit/change-scheduler
-->
1. **Services Impacted** - ~"Service::Web" ~"Service::Websockets" ~"Service::API" ~"Service::Internal-API"
1. **Change Technician** - `@stanhu`
1. **Change Reviewer** - `@sun_lee`, `@msmiley`
1. **Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM)** - `2026-03-04 18:00'
1. **Time tracking** - `2 days`
1. **Downtime Component** - none
> [!IMPORTANT]
> If your change involves scheduled maintenance, add a step to set and
> [unset maintenance mode](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/monitoring/set_maintenance_window.md)
> per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
## Preparation
> [!NOTE]
> The following checklists must be done in advance, before setting the label ~"change::scheduled"
### Change Reviewer checklist
<!--
To be filled out by the reviewer.
-->
~C4 ~C3 ~C2 ~C1:
- [ ] Check if the following applies:
- The **scheduled day and time** of execution of the change is appropriate.
- The [change plan](#detailed-steps-for-the-change) is technically accurate.
- The change plan includes **estimated timing values** based on previous testing.
- The change plan includes a viable [rollback plan](#rollback).
- The specified [metrics/monitoring dashboards](#key-metrics-to-observe) provide sufficient visibility for the change.
~C2 ~C1:
- [ ] Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels ~"blocks deployments" and/or ~"blocks feature-flags" are applied as necessary.
### Change Technician checklist
- [x] The [Change Criticality](https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/#change-criticalities) has been set appropriately and requirements have been reviewed.
- [ ] The [change plan](#detailed-steps-for-the-change) is technically accurate.
- [ ] The [rollback plan](#rollback) is technically accurate and detailed enough to be executed by anyone with access.
- [ ] This Change Issue is linked to the appropriate Issue and/or Epic
- [ ] Change has been tested in staging and results noted in a comment on this issue.
- [ ] A dry-run has been conducted and results noted in a comment on this issue.
- [ ] The change execution window respects the [Production Change Lock periods](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/#production-change-lock-pcl).
- [ ] Once all boxes above are checked, mark the change request as scheduled: `/label ~"change::scheduled"`
- [ ] For ~C1 and ~C2 change issues, the change event is added to the [GitLab Production](https://calendar.google.com/calendar/embed?src=gitlab.com_si2ach70eb1j65cnu040m3alq0%40group.calendar.google.com)
calendar by the [change-scheduler bot](https://gitlab.com/gitlab-com/gl-infra/ops-team/toolkit/change-scheduler).
It is schedule to run every 2 hours.
- [ ] For ~C1 change issues, a Senior Infrastructure Manager has provided approval with the ~manager_approved label on the issue.
- [ ] For ~C2 change issues, an Infrastructure Manager provided approval with the ~manager_approved label on the issue.
- [ ] For ~C1 and ~C2 changes, mention `@gitlab-org/saas-platforms/inframanagers` in this issue to provide visibility to all infrastructure managers.
- [ ] For ~C1, ~C2, or ~"blocks deployments" change issues, confirm with Release managers that the change does not
overlap or hinder any release process (In `#production` channel, mention `@release-managers` and this issue and
await their acknowledgment.)
- [ ] For ~C1 change issues or ~C2 change issues happening during weekend, SREs on-call must be informed
[at least 2 weeks in advance](https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/#approval).
Check [the incident.io GitLab.com Production EOC schedule](https://app.incident.io/gitlab/on-call/schedules/01K5YWAGZ7YCQGAG7ATQ9XQWHW) to find who will be
on-call at the scheduled day and time.
## Detailed steps for the change
### Pre-execution steps
> [!NOTE]
> The following steps should be done right at the scheduled time of the change request. The [preparation steps](#preparation) are
> listed below.
- [ ] Make sure all tasks in [Change Technician checklist](#change-technician-checklist) are done
- [ ] For ~C1 and ~C2 change issues, the SRE on-call has been informed prior to change being rolled out.
- [ ] The SRE on-call provided approval with the ~eoc_approved label on the issue.
- [ ] For ~C1, ~C2, or ~"blocks deployments" change issues, Release managers have been informed prior to change being rolled out. (In `#production` channel, mention `@release-managers` and this issue and await their acknowledgment.)
- [ ] There are currently no [active incidents](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/?sort=created_date&state=opened&label_name%5B%5D=Incident%3A%3AActive&or%5Blabel_name%5D%5B%5D=severity%3A%3A1&or%5Blabel_name%5D%5B%5D=severity%3A%3A2&first_page_size=20) that are ~severity::1 or ~severity::2
- [ ] If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
### Change steps - steps to take to execute the change
We need to add the `workhorse.loadShedding` config:
```
workhorse:
loadShedding:
enabled: true
backlogThreshold: 50
retryAfterSeconds: 0
strategy: max
```
As well as modify NGINX to retry on 503:
```
nginx-ingress:
controller:
config:
proxy-next-upstream: "http_503"
proxy-next-upstream-tries: "3"
proxy-next-upstream-timeout: "10"
```
*Estimated Time to Complete (mins)* - {+Estimated Time to Complete in Minutes+}
- [x] Set label ~"change::in-progress" `/label ~change::in-progress`
- [ ] Enable for all `gstg` deployments
- [ ] Enable for all `gprd-cny` deployments
- [ ] Enable for all `gprd` `us-east1-b` deployments
- [ ] Enable for all `gprd` `us-east1-c` deployments
- [ ] Enable for all `gprd` deployments
- [ ] Set label ~"change::complete" `/label ~change::complete`
#### Gradual rollout to staging
**gstg-cny (Staging Canary)**:
- [x] Merge gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/5188
- [ ] Ping monitoring engineers and `@release-managers`
- [ ] Check monitoring dashboards (see Monitoring section)
**gstg-cny (Staging)**:
- [ ] Merge X
- [ ] Ping monitoring engineers and `@release-managers`
- [ ] Check monitoring dashboards
- [ ] Verify Quality smoke and reliable pipelines passed
#### Gradual rollout to production (Manual Zonal Rollout)
**gprd-cny (Production Canary):**
- [ ] Merge X
- [ ] Ping monitoring engineers and `@release-managers`
- [ ] Check monitoring dashboards
- [ ] Bake for 2 hours or until green light
**Zonal Cluster B:**
- [ ] Merge X
- [ ] Ping monitoring engineers and `@release-managers` (set zone to `b`)
- [ ] Check monitoring dashboards
- [ ] Bake for 2 hours or until green light
**Zonal Cluster C:**
- [ ] Merge X
- [ ] Ping monitoring engineers and `@release-managers` (set zone to `b`)
- [ ] Check monitoring dashboards
- [ ] Bake for 2 hours or until green light
**Zonal Cluster D:**
- [ ] Merge X
- [ ] Ping monitoring engineers and `@release-managers` (set zone to `b`)
- [ ] Check monitoring dashboards
- [ ] Bake for 2 hours or until green light
**Regional Cluster:**
- [ ] Merge X
- [ ] Ping monitoring engineers and `@release-managers` (set zone to `b`)
- [ ] Check monitoring dashboards
- [ ] Bake for 2 hours or until green light
#### Post-deployment
- [ ] `/chatops run auto_deploy unpause`
- [ ] Set label ~"change::complete" `/label ~change::complete`
## Rollback
### Rollback steps - steps to be taken in the event of a need to rollback this change
*Estimated Time to Complete (mins)* - {+Estimated Time to Complete in Minutes+}
- [ ] Revert MRs
- [ ] Set label ~"change::aborted" `/label ~change::aborted`
## Monitoring
### Key metrics to observe
<!--
* Describe which dashboards and which specific metrics we should be monitoring related to this change using the format below.
-->
- Metric: Error ratios
- Location: https://dashboards.gitlab.net/goto/dfcy8io5hi58gc?orgId=1
- What changes to this metric should prompt a rollback: Increased number of 50x errors
- Metric: API Apdex
- Location: https://dashboards.gitlab.net/goto/dfcy8lbx0qku8b?orgId=1
- What changes to this metric should prompt a rollback: Significant spikes in Apdex correlated with errors during deployment/readiness checks
- Metric: Webservice Apdex
- Location: https://dashboards.gitlab.net/goto/bfcy8oj859ji8e?orgId=1
- What changes to this metric should prompt a rollback: Significant spikes in Apdex correlated with errors during deployment/readiness checks
- Metric: 504 errors in Workhorse
- Location: `gprd-cny`: https://log.gprd.gitlab.net/app/r/s/I9KIX (excludes `/-/readiness`)
- Location: `gprd`: https://log.gprd.gitlab.net/app/r/s/CBZSF
- Metric: Number of times load shedding active:
- Location: https://dashboards.gitlab.net/goto/cfg86yim90irke?orgId=1
- Metric: TCP backlog
- Location: https://dashboards.gitlab.net/goto/afg872q5gfs3kb?orgId=1
- Location: https://dashboards.gitlab.net/goto/ffgbxqa5bqccge?orgId=1
- Metric: Load shedding Workhorse messages
- Location (`gprd-cny`): https://dashboards.gitlab.net/goto/afg877n5rwnwga?orgId=1
- Location: https://log.gprd.gitlab.net/app/r/s/6To74
issue