On-Call Handover 2025-10-21 07:00 UTC
On-Call Handover
Brought to you by woodhouse for EOC shift changes only (not IMOC or CMOC)
👥 Shift Transitions
- EOC egress: @ayeung
- EOC ingress: @swainaina @vglafirov
- IMOC egress: @reprazent
- IMOC ingress: @reprazent
- CMOC egress: supportops kkamiya
- CMOC ingress: bfreitas supportops kkamiya tloughlin Previous handover: On-Call Handover 2025-10-20 23:00 UTC
📊 Shift Statistics
-
🔥 2 ongoing incidents -
📝 28 incidents in workflow -
🔵 2 mitigated incidents -
✅ 2 resolved incidents -
📟 7 PagerDuty incidents -
🔧 2 change requests
📖 Summary
What (if any) time-critical work is being handed over?
We have 2 low-severity incidents in a holding pattern at the moment:
-
https://app.incident.io/gitlab/incidents/5089 we attempted to rollout some changes to the webservice readiness checks in the Kube clusters. Unfortunately it seems we hit an edge case with it when deployments are happening and pods get terminated - lots of requests end up going to pods that are shutting down and get 502 responses. We've reverted this change, but have not yet had another deployment to validate that the problem is actually gone.
-
What you need to do: during the next deployment to canary, see if you get alerted for 5xx errors at the loadbalancer for the
web/web-pagesservices. If you don't get alerted then everything is OK, it means that we were right about the cause of the problem. If not...😬
-
What you need to do: during the next deployment to canary, see if you get alerted for 5xx errors at the loadbalancer for the
-
https://app.incident.io/gitlab/incidents/5070 - Zoekt got broken. It's fixed now, but the entire site needs to be reindexed.
-
What you need to do: not much, the advanced search team are largely running this themselves, but to help them out you could run
gitlab-rake gitlab:zoekt:infofrom time to time to check on the progress of the reindexing.
-
What you need to do: not much, the advanced search team are largely running this themselves, but to help them out you could run
What contextual info may be useful for the next few on-call shifts?
🚨 Critical: Ongoing Incidents
GitLab Incidents
| Incident | Status | Severity | Created | GitLab | Summary |
|---|---|---|---|---|---|
| Zoekt search nodes are not ena... | Fixing | severity3 | 2025-10-20 | #20741 | Problem**: Zoekt search nodes went offline due to certificate verification error... |
| Jobs stuck in running indefini... | Fixing | severity3 | 2025-09-02 | #20469 | Problem**: Since early August, some CI jobs remain indefinitely stuck in the run... |
🔵 Monitoring: Mitigated Incidents
👀 Under Monitoring (2 incidents)
| Incident | Status | Severity | Created | GitLab | Summary |
|---|---|---|---|---|---|
| Error rate SLO violation for c... | Monitoring | severity4 | 2025-10-14 | #20722 | Problem**: The error rate for CI runner jobs on the saas-linux-large-amd64 shard... |
| Ongoing ResourceExhausted issu... | Monitoring | severity4 | 2025-09-18 | #20572 | Problem**: A surge of expensive commit-related requests—mainly from unauthenti... |
✅ Recently Resolved
🎉 Resolved During Shift (9 incidents)
| Incident | Status | Severity | Created | GitLab | Summary |
|---|---|---|---|---|---|
| rails_request error rate viola... | Closed | severity3 | 2025-10-20 | #20740 | Problem**: Elevated error rates for the ai-assisted and ai-gateway services due ... |
| generate-facts job in omnibus-... | Closed | severity3 | 2025-10-20 | #20736 |
generate-facts job in omnibus-gitlab pipeline is failing due to a missing tag ... |
| [#141157] firing - Service web... | resolved | pd_urgencyhigh | 2025-10-20 | - | [#141157] firing - Service web (gprd) |
| [#141159] firing - Service web... | resolved | pd_urgencyhigh | 2025-10-20 | - | [#141159] firing - Service web-pages (gprd) |
| [#141187] firing - Service web... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141187] firing - Service web (gprd) |
| [#141189] firing - Service web... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141189] firing - Service web-pages (gprd) |
| [#141193] firing - Service web... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141193] firing - Service web (gprd) |
| [#141194] firing - Service web... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141194] firing - Service web-pages (gprd) |
| [#141216] firing - Service sid... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141216] firing - Service sidekiq (gprd) |
📝 Incidents in Workflow
📋 Documenting/Reviewing Phase (28 incidents)
| Incident | Status | Severity | Created | GitLab | Summary |
|---|---|---|---|---|---|
| dotcom impact from aws outage | Documenting | severity3 | 2025-10-20 | #20739 | Problem**: A upstream service disruption caused a ripple effect, including a spi... |
| Rate Limit - Scheduled Brownou... | Documenting | severity4 | 2025-10-20 | #20738 | Problem**: A scheduled brownout has enforced the authenticated API rate limit at... |
| Monthly release candidate pipe... | Documenting | severity3 | 2025-10-14 | #20721 | Problem**: The 'eslint:fe-island' job in the monthly release candidate pipeline ... |
| Sidekiq queueing apdex SLO vio... | Documenting | severity4 | 2025-10-06 | #20672 | Problem**: Sidekiq jobs on the low-urgency-cpu-bound shard experienced increased... |
| Increased request queuing in a... | Documenting | severity3 | 2025-10-06 | #20671 | Problem**: Requests to the API in the cny stage experienced increased queue time... |
| Search query apdex for zoekt_s... | Documenting | severity4 | 2025-10-01 | #20642 | The exact code search component for GitLab.com, backed by Zoekt, had a brief apd... |
| Apdex SLO violation for zoekt_... | Documenting | severity4 | 2025-09-30 | #20633 | The zoekt_searching component for exact code search on GitLab.com has an apdex v... |
| Sidekiq jobs on low-urgency-cp... | Documenting | severity4 | 2025-09-30 | #20631 | The Sidekiq queueing performance for the low-urgency-cpu-bound shard has an apde... |
| Sidekiq queueing SLI apdex SLO... | Documenting | severity3 | 2025-09-30 | #20628 | Problem**: Low-urgency sidekiq jobs were delayed for up to nearly 2 hours due to... |
| Multiple Versions OF Gitaly Ru... | Documenting | severity4 | 2025-09-24 | #20600 | Problem**: One Gitaly node was running an outdated version due to automation fai... |
| SLO violation in ai-gateway's ... | Documenting | severity4 | 2025-08-07 | #20318 | The AI-gateway service in the europe-west2 region is experiencing an Apdex score... |
| Gitaly goserver SLI apdex viol... | Documenting | severity3 | 2025-08-06 | #20310 | Apdex for the Gitaly node began to drop since 2025-08-04, remaining at ~98.5%. W... |
| Apdex SLO violation in patroni... | Documenting | severity3 | 2025-08-05 | #20308 | The Apdex score for SQL transactions in the Patroni service on the 'main' stage ... |
| lilac-hare | Documenting | severity3 | 2025-07-31 | #20284 | [https://gitlab.slack.com/archives/CETG54GQ0/p1753959312265849](https://gitlab.s... |
| Duo Chat consistent errors | Documenting | severity3 | 2025-07-17 | #20217 | Duo Chat experienced consistent errors due to an Anthropic outage. |
| 2025-05-25: Data-Server Rebuil... | Documenting | severity4 | 2025-06-25 | #20072 | This failure is not PROD failure, but REPLICA. Happened yesterday as well Pipeli... |
| 2025-06-24 | Data server rebui... | Documenting | severity4 | 2025-06-24 | #20066 | The data server rebuild failed on a replacedisk job due to the instance being st... |
| Long-running transaction on Sb... | Documenting | severity3 | 2025-06-04 | #19945 | The Sbom::BuildDependencyGraphWorker endpoint on the sidekiq application is runn... |
| AI Gateway: error rate SLO vio... | Documenting | severity3 | 2025-03-11 | #19460 | The AI Gateway is experiencing an error rate SLO violation due to tightening of ... |
| Patroni main SLI dropping | Reviewing | severity1 | 2025-10-09 | #20705 | Problem**: A user-triggered, unoptimized query pattern from a scheduled CI pipel... |
| GitLab.com slow and not loadin... | Reviewing | severity1 | 2025-10-07 | #20688 | Problem**: GitLab.com experienced complete database saturation across the patron... |
| Gitlab unavailable due to load | Reviewing | severity2 | 2025-10-07 | #20686 | Problem**: GitLab became intermittently unavailable for git operations due to se... |
| The gitlab_sshd git service un... | Reviewing | severity2 | 2025-10-07 | #20680 | Problem**: Git operations were degraded in both main and canary environments, co... |
| GitLab.com Shared runners fail... | Reviewing | severity2 | 2025-10-05 | #20664 | Shared runners on GitLab.com failed to upload... |
| Image pull failures on SaaS Li... | Reviewing | severity2 | 2025-09-24 | #20607 | Problem**: SaaS Linux runners have a 15.06% image pull failure rate, exceeding t... |
| Packagecloud service error rat... | Reviewing | severity2 | 2025-07-22 | #20244 | The Packagecloud service is experiencing an elevated error rate due to a large n... |
| Packagecloud service experienc... | Reviewing | severity2 | 2025-07-16 | #20199 | We've seen the issue re-appear later in the day and we implemented Cloudflare WA... |
| Loadbalancer 5xx error rate fo... | Paused | severity3 | 2025-10-20 | #20745 | Problem**: A spike in 5xx errors affected both web and web-pages services in the... |
🔧 Change Management
🔄 In Progress
No change requests currently in progress.
✅ Completed During Shift
📋 Closed Changes (2 requests)
| Change Request | Closed | GitLab | Status |
|---|---|---|---|
| 2025-10-21: Migrate PlantUML deployment from Tanka to ArgoCD | 2025-10-21 | GitLab |
|
| Migrate fluentd-archiver to vector for gprd | 2025-10-21 | GitLab |
|
Generated via woodhouse on 2025-10-21T07:00:00Z
Edited by Adeline Yeung