Skip to content

On-Call Handover 2025-10-21 07:00 UTC

On-Call Handover

Brought to you by woodhouse for EOC shift changes only (not IMOC or CMOC)

👥 Shift Transitions

📊 Shift Statistics

  • 🔥 2 ongoing incidents
  • 📝 28 incidents in workflow
  • 🔵 2 mitigated incidents
  • 2 resolved incidents
  • 📟 7 PagerDuty incidents
  • 🔧 2 change requests

📖 Summary

What (if any) time-critical work is being handed over?

We have 2 low-severity incidents in a holding pattern at the moment:

  • https://app.incident.io/gitlab/incidents/5089 we attempted to rollout some changes to the webservice readiness checks in the Kube clusters. Unfortunately it seems we hit an edge case with it when deployments are happening and pods get terminated - lots of requests end up going to pods that are shutting down and get 502 responses. We've reverted this change, but have not yet had another deployment to validate that the problem is actually gone.
    • What you need to do: during the next deployment to canary, see if you get alerted for 5xx errors at the loadbalancer for the web/web-pages services. If you don't get alerted then everything is OK, it means that we were right about the cause of the problem. If not... 😬
  • https://app.incident.io/gitlab/incidents/5070 - Zoekt got broken. It's fixed now, but the entire site needs to be reindexed.
    • What you need to do: not much, the advanced search team are largely running this themselves, but to help them out you could run gitlab-rake gitlab:zoekt:info from time to time to check on the progress of the reindexing.

What contextual info may be useful for the next few on-call shifts?

🚨 Critical: Ongoing Incidents

GitLab Incidents

Incident Status Severity Created GitLab Summary
Zoekt search nodes are not ena... Fixing severity3 2025-10-20 #20741 Problem**: Zoekt search nodes went offline due to certificate verification error...
Jobs stuck in running indefini... Fixing severity3 2025-09-02 #20469 Problem**: Since early August, some CI jobs remain indefinitely stuck in the run...

🔵 Monitoring: Mitigated Incidents

👀 Under Monitoring (2 incidents)
Incident Status Severity Created GitLab Summary
Error rate SLO violation for c... Monitoring severity4 2025-10-14 #20722 Problem**: The error rate for CI runner jobs on the saas-linux-large-amd64 shard...
Ongoing ResourceExhausted issu... Monitoring severity4 2025-09-18 #20572 Problem**: A surge of expensive commit-related requests—mainly from unauthenti...

Recently Resolved

🎉 Resolved During Shift (9 incidents)
Incident Status Severity Created GitLab Summary
rails_request error rate viola... Closed severity3 2025-10-20 #20740 Problem**: Elevated error rates for the ai-assisted and ai-gateway services due ...
generate-facts job in omnibus-... Closed severity3 2025-10-20 #20736 generate-facts job in omnibus-gitlab pipeline is failing due to a missing tag ...
[#141157] firing - Service web... resolved pd_urgencyhigh 2025-10-20 - [#141157] firing - Service web (gprd)
[#141159] firing - Service web... resolved pd_urgencyhigh 2025-10-20 - [#141159] firing - Service web-pages (gprd)
[#141187] firing - Service web... resolved pd_urgencyhigh 2025-10-21 - [#141187] firing - Service web (gprd)
[#141189] firing - Service web... resolved pd_urgencyhigh 2025-10-21 - [#141189] firing - Service web-pages (gprd)
[#141193] firing - Service web... resolved pd_urgencyhigh 2025-10-21 - [#141193] firing - Service web (gprd)
[#141194] firing - Service web... resolved pd_urgencyhigh 2025-10-21 - [#141194] firing - Service web-pages (gprd)
[#141216] firing - Service sid... resolved pd_urgencyhigh 2025-10-21 - [#141216] firing - Service sidekiq (gprd)

📝 Incidents in Workflow

📋 Documenting/Reviewing Phase (28 incidents)
Incident Status Severity Created GitLab Summary
dotcom impact from aws outage Documenting severity3 2025-10-20 #20739 Problem**: A upstream service disruption caused a ripple effect, including a spi...
Rate Limit - Scheduled Brownou... Documenting severity4 2025-10-20 #20738 Problem**: A scheduled brownout has enforced the authenticated API rate limit at...
Monthly release candidate pipe... Documenting severity3 2025-10-14 #20721 Problem**: The 'eslint:fe-island' job in the monthly release candidate pipeline ...
Sidekiq queueing apdex SLO vio... Documenting severity4 2025-10-06 #20672 Problem**: Sidekiq jobs on the low-urgency-cpu-bound shard experienced increased...
Increased request queuing in a... Documenting severity3 2025-10-06 #20671 Problem**: Requests to the API in the cny stage experienced increased queue time...
Search query apdex for zoekt_s... Documenting severity4 2025-10-01 #20642 The exact code search component for GitLab.com, backed by Zoekt, had a brief apd...
Apdex SLO violation for zoekt_... Documenting severity4 2025-09-30 #20633 The zoekt_searching component for exact code search on GitLab.com has an apdex v...
Sidekiq jobs on low-urgency-cp... Documenting severity4 2025-09-30 #20631 The Sidekiq queueing performance for the low-urgency-cpu-bound shard has an apde...
Sidekiq queueing SLI apdex SLO... Documenting severity3 2025-09-30 #20628 Problem**: Low-urgency sidekiq jobs were delayed for up to nearly 2 hours due to...
Multiple Versions OF Gitaly Ru... Documenting severity4 2025-09-24 #20600 Problem**: One Gitaly node was running an outdated version due to automation fai...
SLO violation in ai-gateway's ... Documenting severity4 2025-08-07 #20318 The AI-gateway service in the europe-west2 region is experiencing an Apdex score...
Gitaly goserver SLI apdex viol... Documenting severity3 2025-08-06 #20310 Apdex for the Gitaly node began to drop since 2025-08-04, remaining at ~98.5%. W...
Apdex SLO violation in patroni... Documenting severity3 2025-08-05 #20308 The Apdex score for SQL transactions in the Patroni service on the 'main' stage ...
lilac-hare Documenting severity3 2025-07-31 #20284 [https://gitlab.slack.com/archives/CETG54GQ0/p1753959312265849](https://gitlab.s...
Duo Chat consistent errors Documenting severity3 2025-07-17 #20217 Duo Chat experienced consistent errors due to an Anthropic outage.
2025-05-25: Data-Server Rebuil... Documenting severity4 2025-06-25 #20072 This failure is not PROD failure, but REPLICA. Happened yesterday as well Pipeli...
2025-06-24 | Data server rebui... Documenting severity4 2025-06-24 #20066 The data server rebuild failed on a replacedisk job due to the instance being st...
Long-running transaction on Sb... Documenting severity3 2025-06-04 #19945 The Sbom::BuildDependencyGraphWorker endpoint on the sidekiq application is runn...
AI Gateway: error rate SLO vio... Documenting severity3 2025-03-11 #19460 The AI Gateway is experiencing an error rate SLO violation due to tightening of ...
Patroni main SLI dropping Reviewing severity1 2025-10-09 #20705 Problem**: A user-triggered, unoptimized query pattern from a scheduled CI pipel...
GitLab.com slow and not loadin... Reviewing severity1 2025-10-07 #20688 Problem**: GitLab.com experienced complete database saturation across the patron...
Gitlab unavailable due to load Reviewing severity2 2025-10-07 #20686 Problem**: GitLab became intermittently unavailable for git operations due to se...
The gitlab_sshd git service un... Reviewing severity2 2025-10-07 #20680 Problem**: Git operations were degraded in both main and canary environments, co...
GitLab.com Shared runners fail... Reviewing severity2 2025-10-05 #20664 Shared runners on GitLab.com failed to upload...
Image pull failures on SaaS Li... Reviewing severity2 2025-09-24 #20607 Problem**: SaaS Linux runners have a 15.06% image pull failure rate, exceeding t...
Packagecloud service error rat... Reviewing severity2 2025-07-22 #20244 The Packagecloud service is experiencing an elevated error rate due to a large n...
Packagecloud service experienc... Reviewing severity2 2025-07-16 #20199 We've seen the issue re-appear later in the day and we implemented Cloudflare WA...
Loadbalancer 5xx error rate fo... Paused severity3 2025-10-20 #20745 Problem**: A spike in 5xx errors affected both web and web-pages services in the...

🔧 Change Management

🔄 In Progress

No change requests currently in progress.

Completed During Shift

📋 Closed Changes (2 requests)
Change Request Closed GitLab Status
2025-10-21: Migrate PlantUML deployment from Tanka to ArgoCD 2025-10-21 GitLab Closed
Migrate fluentd-archiver to vector for gprd 2025-10-21 GitLab Closed

Generated via woodhouse on 2025-10-21T07:00:00Z

Edited by Adeline Yeung