Enable Workhorse load shedding
<!-- Please review https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/ for the most recent information on our change plans and execution policies. --> # Production Change ## Change Summary We want to enable the Workhorse load shedder added in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/218865. This is to address https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/28055. ## Change Details <!-- To automatically add your change to the GitLab Production calendar update the following fields: - Time tracking - Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) Bot: https://gitlab.com/gitlab-com/gl-infra/ops-team/toolkit/change-scheduler --> 1. **Services Impacted** - ~"Service::Web" ~"Service::Websockets" ~"Service::API" ~"Service::Internal-API" 1. **Change Technician** - `@stanhu` 1. **Change Reviewer** - `@sun_lee`, `@msmiley` 1. **Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM)** - `2026-03-04 18:00' 1. **Time tracking** - `2 days` 1. **Downtime Component** - none > [!IMPORTANT] > If your change involves scheduled maintenance, add a step to set and > [unset maintenance mode](https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/monitoring/set_maintenance_window.md) > per our runbooks. This will make sure SLA calculations adjust for the maintenance period. ## Preparation > [!NOTE] > The following checklists must be done in advance, before setting the label ~"change::scheduled" ### Change Reviewer checklist <!-- To be filled out by the reviewer. --> ~C4 ~C3 ~C2 ~C1: - [ ] Check if the following applies: - The **scheduled day and time** of execution of the change is appropriate. - The [change plan](#detailed-steps-for-the-change) is technically accurate. - The change plan includes **estimated timing values** based on previous testing. - The change plan includes a viable [rollback plan](#rollback). - The specified [metrics/monitoring dashboards](#key-metrics-to-observe) provide sufficient visibility for the change. ~C2 ~C1: - [ ] Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details). - The change plan includes success measures for all steps/milestones during the execution. - The change adequately minimizes risk within the environment/service. - The performance implications of executing the change are well-understood and documented. - The specified metrics/monitoring dashboards provide sufficient visibility for the change. - If not, is it possible (or necessary) to make changes to observability platforms for added visibility? - The change has a primary and secondary SRE with knowledge of the details available during the change window. - The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval. - The labels ~"blocks deployments" and/or ~"blocks feature-flags" are applied as necessary. ### Change Technician checklist - [x] The [Change Criticality](https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/#change-criticalities) has been set appropriately and requirements have been reviewed. - [ ] The [change plan](#detailed-steps-for-the-change) is technically accurate. - [ ] The [rollback plan](#rollback) is technically accurate and detailed enough to be executed by anyone with access. - [ ] This Change Issue is linked to the appropriate Issue and/or Epic - [ ] Change has been tested in staging and results noted in a comment on this issue. - [ ] A dry-run has been conducted and results noted in a comment on this issue. - [ ] The change execution window respects the [Production Change Lock periods](https://about.gitlab.com/handbook/engineering/infrastructure/change-management/#production-change-lock-pcl). - [ ] Once all boxes above are checked, mark the change request as scheduled: `/label ~"change::scheduled"` - [ ] For ~C1 and ~C2 change issues, the change event is added to the [GitLab Production](https://calendar.google.com/calendar/embed?src=gitlab.com_si2ach70eb1j65cnu040m3alq0%40group.calendar.google.com) calendar by the [change-scheduler bot](https://gitlab.com/gitlab-com/gl-infra/ops-team/toolkit/change-scheduler). It is schedule to run every 2 hours. - [ ] For ~C1 change issues, a Senior Infrastructure Manager has provided approval with the ~manager_approved label on the issue. - [ ] For ~C2 change issues, an Infrastructure Manager provided approval with the ~manager_approved label on the issue. - [ ] For ~C1 and ~C2 changes, mention `@gitlab-org/saas-platforms/inframanagers` in this issue to provide visibility to all infrastructure managers. - [ ] For ~C1, ~C2, or ~"blocks deployments" change issues, confirm with Release managers that the change does not overlap or hinder any release process (In `#production` channel, mention `@release-managers` and this issue and await their acknowledgment.) - [ ] For ~C1 change issues or ~C2 change issues happening during weekend, SREs on-call must be informed [at least 2 weeks in advance](https://handbook.gitlab.com/handbook/engineering/infrastructure-platforms/change-management/#approval). Check [the incident.io GitLab.com Production EOC schedule](https://app.incident.io/gitlab/on-call/schedules/01K5YWAGZ7YCQGAG7ATQ9XQWHW) to find who will be on-call at the scheduled day and time. ## Detailed steps for the change ### Pre-execution steps > [!NOTE] > The following steps should be done right at the scheduled time of the change request. The [preparation steps](#preparation) are > listed below. - [ ] Make sure all tasks in [Change Technician checklist](#change-technician-checklist) are done - [ ] For ~C1 and ~C2 change issues, the SRE on-call has been informed prior to change being rolled out. - [ ] The SRE on-call provided approval with the ~eoc_approved label on the issue. - [ ] For ~C1, ~C2, or ~"blocks deployments" change issues, Release managers have been informed prior to change being rolled out. (In `#production` channel, mention `@release-managers` and this issue and await their acknowledgment.) - [ ] There are currently no [active incidents](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/?sort=created_date&state=opened&label_name%5B%5D=Incident%3A%3AActive&or%5Blabel_name%5D%5B%5D=severity%3A%3A1&or%5Blabel_name%5D%5B%5D=severity%3A%3A2&first_page_size=20) that are ~severity::1 or ~severity::2 - [ ] If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change. ### Change steps - steps to take to execute the change We need to add the `workhorse.loadShedding` config: ``` workhorse: loadShedding: enabled: true backlogThreshold: 50 retryAfterSeconds: 0 strategy: max ``` As well as modify NGINX to retry on 503: ``` nginx-ingress: controller: config: proxy-next-upstream: "http_503" proxy-next-upstream-tries: "3" proxy-next-upstream-timeout: "10" ``` *Estimated Time to Complete (mins)* - {+Estimated Time to Complete in Minutes+} - [x] Set label ~"change::in-progress" `/label ~change::in-progress` - [ ] Enable for all `gstg` deployments - [ ] Enable for all `gprd-cny` deployments - [ ] Enable for all `gprd` `us-east1-b` deployments - [ ] Enable for all `gprd` `us-east1-c` deployments - [ ] Enable for all `gprd` deployments - [ ] Set label ~"change::complete" `/label ~change::complete` #### Gradual rollout to staging **gstg-cny (Staging Canary)**: - [x] Merge gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/merge_requests/5188 - [ ] Ping monitoring engineers and `@release-managers` - [ ] Check monitoring dashboards (see Monitoring section) **gstg-cny (Staging)**: - [ ] Merge X - [ ] Ping monitoring engineers and `@release-managers` - [ ] Check monitoring dashboards - [ ] Verify Quality smoke and reliable pipelines passed #### Gradual rollout to production (Manual Zonal Rollout) **gprd-cny (Production Canary):** - [ ] Merge X - [ ] Ping monitoring engineers and `@release-managers` - [ ] Check monitoring dashboards - [ ] Bake for 2 hours or until green light **Zonal Cluster B:** - [ ] Merge X - [ ] Ping monitoring engineers and `@release-managers` (set zone to `b`) - [ ] Check monitoring dashboards - [ ] Bake for 2 hours or until green light **Zonal Cluster C:** - [ ] Merge X - [ ] Ping monitoring engineers and `@release-managers` (set zone to `b`) - [ ] Check monitoring dashboards - [ ] Bake for 2 hours or until green light **Zonal Cluster D:** - [ ] Merge X - [ ] Ping monitoring engineers and `@release-managers` (set zone to `b`) - [ ] Check monitoring dashboards - [ ] Bake for 2 hours or until green light **Regional Cluster:** - [ ] Merge X - [ ] Ping monitoring engineers and `@release-managers` (set zone to `b`) - [ ] Check monitoring dashboards - [ ] Bake for 2 hours or until green light #### Post-deployment - [ ] `/chatops run auto_deploy unpause` - [ ] Set label ~&quot;change::complete&quot; `/label ~change::complete` ## Rollback ### Rollback steps - steps to be taken in the event of a need to rollback this change *Estimated Time to Complete (mins)* - {+Estimated Time to Complete in Minutes+} - [ ] Revert MRs - [ ] Set label ~"change::aborted" `/label ~change::aborted` ## Monitoring ### Key metrics to observe <!-- * Describe which dashboards and which specific metrics we should be monitoring related to this change using the format below. --> - Metric: Error ratios - Location: https://dashboards.gitlab.net/goto/dfcy8io5hi58gc?orgId=1 - What changes to this metric should prompt a rollback: Increased number of 50x errors - Metric: API Apdex - Location: https://dashboards.gitlab.net/goto/dfcy8lbx0qku8b?orgId=1 - What changes to this metric should prompt a rollback: Significant spikes in Apdex correlated with errors during deployment/readiness checks - Metric: Webservice Apdex - Location: https://dashboards.gitlab.net/goto/bfcy8oj859ji8e?orgId=1 - What changes to this metric should prompt a rollback: Significant spikes in Apdex correlated with errors during deployment/readiness checks - Metric: 504 errors in Workhorse - Location: `gprd-cny`: https://log.gprd.gitlab.net/app/r/s/I9KIX (excludes `/-/readiness`) - Location: `gprd`: https://log.gprd.gitlab.net/app/r/s/CBZSF - Metric: Number of times load shedding active: - Location: https://dashboards.gitlab.net/goto/cfg86yim90irke?orgId=1 - Metric: TCP backlog - Location: https://dashboards.gitlab.net/goto/afg872q5gfs3kb?orgId=1 - Location: https://dashboards.gitlab.net/goto/ffgbxqa5bqccge?orgId=1 - Metric: Load shedding Workhorse messages - Location (`gprd-cny`): https://dashboards.gitlab.net/goto/afg877n5rwnwga?orgId=1 - Location: https://log.gprd.gitlab.net/app/r/s/6To74
issue