Skip to content

On-Call Handover 2025-10-21 23:00 UTC

On-Call Handover

Brought to you by woodhouse for EOC shift changes only (not IMOC or CMOC)

👥 Shift Transitions

📊 Shift Statistics

  • 🔥 2 ongoing incidents
  • 📝 26 incidents in workflow
  • 🔵 2 mitigated incidents
  • 1 resolved incidents
  • 📟 4 PagerDuty incidents
  • 🔧 2 change requests

📖 Summary

What (if any) time-critical work is being handed over?

What contextual info may be useful for the next few on-call shifts?

🚨 Critical: Ongoing Incidents

GitLab Incidents

Incident Status Severity Created GitLab Summary
Zoekt search nodes are not ena... Fixing severity3 2025-10-20 #20741 Problem**: Zoekt search nodes went offline due to certificate verification error...
Jobs stuck in running indefini... Fixing severity3 2025-09-02 #20469 Problem**: Since early August, some CI jobs remain indefinitely stuck in the run...

🔵 Monitoring: Mitigated Incidents

👀 Under Monitoring (2 incidents)
Incident Status Severity Created GitLab Summary
Error rate SLO violation for c... Monitoring severity4 2025-10-14 #20722 Problem**: The error rate for CI runner jobs on the saas-linux-large-amd64 shard...
Ongoing ResourceExhausted issu... Monitoring severity4 2025-09-18 #20572 Problem**: A surge of expensive commit-related requests—mainly from unauthenti...

Recently Resolved

🎉 Resolved During Shift (5 incidents)
Incident Status Severity Created GitLab Summary
No traffic reported by primary... Closed severity3 2025-10-21 #20749 Problem**: The primary_server component of the redis-registry-cache system in th...
[#141299] firing - Service sid... resolved pd_urgencyhigh 2025-10-21 - [#141299] firing - Service sidekiq (gprd)
[#141302] firing - Service sid... resolved pd_urgencyhigh 2025-10-21 - [#141302] firing - Service sidekiq (gprd)
[#141357] firing - Service sid... resolved pd_urgencyhigh 2025-10-21 - [#141357] firing - Service sidekiq (gprd)
[#141362] firing - Service bla... resolved pd_urgencyhigh 2025-10-21 - [#141362] firing - Service blackbox (gprd)

📝 Incidents in Workflow

📋 Documenting/Reviewing Phase (26 incidents)
Incident Status Severity Created GitLab Summary
dotcom impact from aws outage Documenting severity3 2025-10-20 #20739 Problem**: A upstream service disruption caused a ripple effect, including a spi...
Monthly release candidate pipe... Documenting severity3 2025-10-14 #20721 Problem**: The 'eslint:fe-island' job in the monthly release candidate pipeline ...
Sidekiq queueing apdex SLO vio... Documenting severity4 2025-10-06 #20672 Problem**: Sidekiq jobs on the low-urgency-cpu-bound shard experienced increased...
Increased request queuing in a... Documenting severity3 2025-10-06 #20671 Problem**: Requests to the API in the cny stage experienced increased queue time...
Search query apdex for zoekt_s... Documenting severity4 2025-10-01 #20642 The exact code search component for GitLab.com, backed by Zoekt, had a brief apd...
Apdex SLO violation for zoekt_... Documenting severity4 2025-09-30 #20633 The zoekt_searching component for exact code search on GitLab.com has an apdex v...
Sidekiq jobs on low-urgency-cp... Documenting severity4 2025-09-30 #20631 The Sidekiq queueing performance for the low-urgency-cpu-bound shard has an apde...
Sidekiq queueing SLI apdex SLO... Documenting severity3 2025-09-30 #20628 Problem**: Low-urgency sidekiq jobs were delayed for up to nearly 2 hours due to...
Multiple Versions OF Gitaly Ru... Documenting severity4 2025-09-24 #20600 Problem**: One Gitaly node was running an outdated version due to automation fai...
SLO violation in ai-gateway's ... Documenting severity4 2025-08-07 #20318 The AI-gateway service in the europe-west2 region is experiencing an Apdex score...
Gitaly goserver SLI apdex viol... Documenting severity3 2025-08-06 #20310 Apdex for the Gitaly node began to drop since 2025-08-04, remaining at ~98.5%. W...
Apdex SLO violation in patroni... Documenting severity3 2025-08-05 #20308 The Apdex score for SQL transactions in the Patroni service on the 'main' stage ...
lilac-hare Documenting severity3 2025-07-31 #20284 [https://gitlab.slack.com/archives/CETG54GQ0/p1753959312265849](https://gitlab.s...
Duo Chat consistent errors Documenting severity3 2025-07-17 #20217 Duo Chat experienced consistent errors due to an Anthropic outage.
2025-05-25: Data-Server Rebuil... Documenting severity4 2025-06-25 #20072 This failure is not PROD failure, but REPLICA. Happened yesterday as well Pipeli...
2025-06-24 | Data server rebui... Documenting severity4 2025-06-24 #20066 The data server rebuild failed on a replacedisk job due to the instance being st...
Long-running transaction on Sb... Documenting severity3 2025-06-04 #19945 The Sbom::BuildDependencyGraphWorker endpoint on the sidekiq application is runn...
AI Gateway: error rate SLO vio... Documenting severity3 2025-03-11 #19460 The AI Gateway is experiencing an error rate SLO violation due to tightening of ...
Patroni main SLI dropping Reviewing severity1 2025-10-09 #20705 Problem**: A user-triggered, unoptimized query pattern from a scheduled CI pipel...
GitLab.com slow and not loadin... Reviewing severity1 2025-10-07 #20688 Problem**: GitLab.com experienced complete database saturation across the patron...
Gitlab unavailable due to load Reviewing severity2 2025-10-07 #20686 Problem**: GitLab became intermittently unavailable for git operations due to se...
The gitlab_sshd git service un... Reviewing severity2 2025-10-07 #20680 Problem**: Git operations were degraded in both main and canary environments, co...
GitLab.com Shared runners fail... Reviewing severity2 2025-10-05 #20664 Shared runners on GitLab.com failed to upload...
Image pull failures on SaaS Li... Reviewing severity2 2025-09-24 #20607 Problem**: SaaS Linux runners have a 15.06% image pull failure rate, exceeding t...
Packagecloud service error rat... Reviewing severity2 2025-07-22 #20244 The Packagecloud service is experiencing an elevated error rate due to a large n...
Packagecloud service experienc... Reviewing severity2 2025-07-16 #20199 We've seen the issue re-appear later in the day and we implemented Cloudflare WA...

🔧 Change Management

🔄 In Progress

No change requests currently in progress.

Completed During Shift

📋 Closed Changes (2 requests)
Change Request Closed GitLab Status
[Feature flag]: Rollout of `allow_immediate_namespaces_delet... 2025-10-21 GitLab Closed
2025-10-20: [gprd] Decommission redis-registry-cache sentine... 2025-10-21 GitLab Closed

Generated via woodhouse on 2025-10-21T23:00:00Z