On-Call Handover 2025-10-21 23:00 UTC
On-Call Handover
Brought to you by woodhouse for EOC shift changes only (not IMOC or CMOC)
👥 Shift Transitions
- EOC egress: @danielryan
- EOC ingress: @ayeung
- IMOC egress: @mikeeddington
- IMOC ingress: @nrosandich
- CMOC egress: mdunninger twilliams supportops
- CMOC ingress: mdunninger twilliams clawrence supportops Previous handover: On-Call Handover 2025-10-21 15:00 UTC
📊 Shift Statistics
-
🔥 2 ongoing incidents -
📝 26 incidents in workflow -
🔵 2 mitigated incidents -
✅ 1 resolved incidents -
📟 4 PagerDuty incidents -
🔧 2 change requests
📖 Summary
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
🚨 Critical: Ongoing Incidents
GitLab Incidents
Incident | Status | Severity | Created | GitLab | Summary |
---|---|---|---|---|---|
Zoekt search nodes are not ena... | Fixing | severity3 | 2025-10-20 | #20741 | Problem**: Zoekt search nodes went offline due to certificate verification error... |
Jobs stuck in running indefini... | Fixing | severity3 | 2025-09-02 | #20469 | Problem**: Since early August, some CI jobs remain indefinitely stuck in the run... |
🔵 Monitoring: Mitigated Incidents
👀 Under Monitoring (2 incidents)
Incident | Status | Severity | Created | GitLab | Summary |
---|---|---|---|---|---|
Error rate SLO violation for c... | Monitoring | severity4 | 2025-10-14 | #20722 | Problem**: The error rate for CI runner jobs on the saas-linux-large-amd64 shard... |
Ongoing ResourceExhausted issu... | Monitoring | severity4 | 2025-09-18 | #20572 | Problem**: A surge of expensive commit-related requests—mainly from unauthenti... |
✅ Recently Resolved
🎉 Resolved During Shift (5 incidents)
Incident | Status | Severity | Created | GitLab | Summary |
---|---|---|---|---|---|
No traffic reported by primary... | Closed | severity3 | 2025-10-21 | #20749 | Problem**: The primary_server component of the redis-registry-cache system in th... |
[#141299] firing - Service sid... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141299] firing - Service sidekiq (gprd) |
[#141302] firing - Service sid... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141302] firing - Service sidekiq (gprd) |
[#141357] firing - Service sid... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141357] firing - Service sidekiq (gprd) |
[#141362] firing - Service bla... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141362] firing - Service blackbox (gprd) |
📝 Incidents in Workflow
📋 Documenting/Reviewing Phase (26 incidents)
Incident | Status | Severity | Created | GitLab | Summary |
---|---|---|---|---|---|
dotcom impact from aws outage | Documenting | severity3 | 2025-10-20 | #20739 | Problem**: A upstream service disruption caused a ripple effect, including a spi... |
Monthly release candidate pipe... | Documenting | severity3 | 2025-10-14 | #20721 | Problem**: The 'eslint:fe-island' job in the monthly release candidate pipeline ... |
Sidekiq queueing apdex SLO vio... | Documenting | severity4 | 2025-10-06 | #20672 | Problem**: Sidekiq jobs on the low-urgency-cpu-bound shard experienced increased... |
Increased request queuing in a... | Documenting | severity3 | 2025-10-06 | #20671 | Problem**: Requests to the API in the cny stage experienced increased queue time... |
Search query apdex for zoekt_s... | Documenting | severity4 | 2025-10-01 | #20642 | The exact code search component for GitLab.com, backed by Zoekt, had a brief apd... |
Apdex SLO violation for zoekt_... | Documenting | severity4 | 2025-09-30 | #20633 | The zoekt_searching component for exact code search on GitLab.com has an apdex v... |
Sidekiq jobs on low-urgency-cp... | Documenting | severity4 | 2025-09-30 | #20631 | The Sidekiq queueing performance for the low-urgency-cpu-bound shard has an apde... |
Sidekiq queueing SLI apdex SLO... | Documenting | severity3 | 2025-09-30 | #20628 | Problem**: Low-urgency sidekiq jobs were delayed for up to nearly 2 hours due to... |
Multiple Versions OF Gitaly Ru... | Documenting | severity4 | 2025-09-24 | #20600 | Problem**: One Gitaly node was running an outdated version due to automation fai... |
SLO violation in ai-gateway's ... | Documenting | severity4 | 2025-08-07 | #20318 | The AI-gateway service in the europe-west2 region is experiencing an Apdex score... |
Gitaly goserver SLI apdex viol... | Documenting | severity3 | 2025-08-06 | #20310 | Apdex for the Gitaly node began to drop since 2025-08-04, remaining at ~98.5%. W... |
Apdex SLO violation in patroni... | Documenting | severity3 | 2025-08-05 | #20308 | The Apdex score for SQL transactions in the Patroni service on the 'main' stage ... |
lilac-hare | Documenting | severity3 | 2025-07-31 | #20284 | [https://gitlab.slack.com/archives/CETG54GQ0/p1753959312265849](https://gitlab.s... |
Duo Chat consistent errors | Documenting | severity3 | 2025-07-17 | #20217 | Duo Chat experienced consistent errors due to an Anthropic outage. |
2025-05-25: Data-Server Rebuil... | Documenting | severity4 | 2025-06-25 | #20072 | This failure is not PROD failure, but REPLICA. Happened yesterday as well Pipeli... |
2025-06-24 | Data server rebui... | Documenting | severity4 | 2025-06-24 | #20066 | The data server rebuild failed on a replacedisk job due to the instance being st... |
Long-running transaction on Sb... | Documenting | severity3 | 2025-06-04 | #19945 | The Sbom::BuildDependencyGraphWorker endpoint on the sidekiq application is runn... |
AI Gateway: error rate SLO vio... | Documenting | severity3 | 2025-03-11 | #19460 | The AI Gateway is experiencing an error rate SLO violation due to tightening of ... |
Patroni main SLI dropping | Reviewing | severity1 | 2025-10-09 | #20705 | Problem**: A user-triggered, unoptimized query pattern from a scheduled CI pipel... |
GitLab.com slow and not loadin... | Reviewing | severity1 | 2025-10-07 | #20688 | Problem**: GitLab.com experienced complete database saturation across the patron... |
Gitlab unavailable due to load | Reviewing | severity2 | 2025-10-07 | #20686 | Problem**: GitLab became intermittently unavailable for git operations due to se... |
The gitlab_sshd git service un... | Reviewing | severity2 | 2025-10-07 | #20680 | Problem**: Git operations were degraded in both main and canary environments, co... |
GitLab.com Shared runners fail... | Reviewing | severity2 | 2025-10-05 | #20664 | Shared runners on GitLab.com failed to upload... |
Image pull failures on SaaS Li... | Reviewing | severity2 | 2025-09-24 | #20607 | Problem**: SaaS Linux runners have a 15.06% image pull failure rate, exceeding t... |
Packagecloud service error rat... | Reviewing | severity2 | 2025-07-22 | #20244 | The Packagecloud service is experiencing an elevated error rate due to a large n... |
Packagecloud service experienc... | Reviewing | severity2 | 2025-07-16 | #20199 | We've seen the issue re-appear later in the day and we implemented Cloudflare WA... |
🔧 Change Management
🔄 In Progress
No change requests currently in progress.
✅ Completed During Shift
📋 Closed Changes (2 requests)
Change Request | Closed | GitLab | Status |
---|---|---|---|
[Feature flag]: Rollout of `allow_immediate_namespaces_delet... | 2025-10-21 | GitLab |
|
2025-10-20: [gprd] Decommission redis-registry-cache sentine... | 2025-10-21 | GitLab |
|
Generated via woodhouse on 2025-10-21T23:00:00Z