On-Call Handover 2025-10-21 15:00 UTC
On-Call Handover
Brought to you by woodhouse for EOC shift changes only (not IMOC or CMOC)
👥 Shift Transitions
- EOC egress: @swainaina @vglafirov
- EOC ingress: @danielryan
- IMOC egress: @sashi_kumar
- IMOC ingress: @jayswain
- CMOC egress: bfreitas supportops tloughlin
- CMOC ingress: twilliams bfreitas supportops imason Previous handover: On-Call Handover 2025-10-21 07:00 UTC
📊 Shift Statistics
-
🔥 2 ongoing incidents -
📝 27 incidents in workflow -
🔵 2 mitigated incidents -
✅ 1 resolved incidents -
📟 1 PagerDuty incidents -
🔧 0 change requests
📖 Summary
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
🚨 Critical: Ongoing Incidents
GitLab Incidents
| Incident | Status | Severity | Created | GitLab | Summary |
|---|---|---|---|---|---|
| Zoekt search nodes are not ena... | Fixing | severity3 | 2025-10-20 | #20741 | Problem**: Zoekt search nodes went offline due to certificate verification error... |
| Jobs stuck in running indefini... | Fixing | severity3 | 2025-09-02 | #20469 | Problem**: Since early August, some CI jobs remain indefinitely stuck in the run... |
🔵 Monitoring: Mitigated Incidents
👀 Under Monitoring (2 incidents)
| Incident | Status | Severity | Created | GitLab | Summary |
|---|---|---|---|---|---|
| Error rate SLO violation for c... | Monitoring | severity4 | 2025-10-14 | #20722 | Problem**: The error rate for CI runner jobs on the saas-linux-large-amd64 shard... |
| Ongoing ResourceExhausted issu... | Monitoring | severity4 | 2025-09-18 | #20572 | Problem**: A surge of expensive commit-related requests—mainly from unauthenti... |
✅ Recently Resolved
🎉 Resolved During Shift (2 incidents)
| Incident | Status | Severity | Created | GitLab | Summary |
|---|---|---|---|---|---|
| No traffic reported by primary... | Closed | severity3 | 2025-10-21 | #20749 | Problem**: The primary_server component of the redis-registry-cache system in th... |
| [#141237] firing - Service red... | resolved | pd_urgencyhigh | 2025-10-21 | - | [#141237] firing - Service redis-registry-cache (gprd) |
📝 Incidents in Workflow
📋 Documenting/Reviewing Phase (27 incidents)
| Incident | Status | Severity | Created | GitLab | Summary |
|---|---|---|---|---|---|
| dotcom impact from aws outage | Documenting | severity3 | 2025-10-20 | #20739 | Problem**: A upstream service disruption caused a ripple effect, including a spi... |
| Rate Limit - Scheduled Brownou... | Documenting | severity4 | 2025-10-20 | #20738 | Problem**: A scheduled brownout has enforced the authenticated API rate limit at... |
| Monthly release candidate pipe... | Documenting | severity3 | 2025-10-14 | #20721 | Problem**: The 'eslint:fe-island' job in the monthly release candidate pipeline ... |
| Sidekiq queueing apdex SLO vio... | Documenting | severity4 | 2025-10-06 | #20672 | Problem**: Sidekiq jobs on the low-urgency-cpu-bound shard experienced increased... |
| Increased request queuing in a... | Documenting | severity3 | 2025-10-06 | #20671 | Problem**: Requests to the API in the cny stage experienced increased queue time... |
| Search query apdex for zoekt_s... | Documenting | severity4 | 2025-10-01 | #20642 | The exact code search component for GitLab.com, backed by Zoekt, had a brief apd... |
| Apdex SLO violation for zoekt_... | Documenting | severity4 | 2025-09-30 | #20633 | The zoekt_searching component for exact code search on GitLab.com has an apdex v... |
| Sidekiq jobs on low-urgency-cp... | Documenting | severity4 | 2025-09-30 | #20631 | The Sidekiq queueing performance for the low-urgency-cpu-bound shard has an apde... |
| Sidekiq queueing SLI apdex SLO... | Documenting | severity3 | 2025-09-30 | #20628 | Problem**: Low-urgency sidekiq jobs were delayed for up to nearly 2 hours due to... |
| Multiple Versions OF Gitaly Ru... | Documenting | severity4 | 2025-09-24 | #20600 | Problem**: One Gitaly node was running an outdated version due to automation fai... |
| SLO violation in ai-gateway's ... | Documenting | severity4 | 2025-08-07 | #20318 | The AI-gateway service in the europe-west2 region is experiencing an Apdex score... |
| Gitaly goserver SLI apdex viol... | Documenting | severity3 | 2025-08-06 | #20310 | Apdex for the Gitaly node began to drop since 2025-08-04, remaining at ~98.5%. W... |
| Apdex SLO violation in patroni... | Documenting | severity3 | 2025-08-05 | #20308 | The Apdex score for SQL transactions in the Patroni service on the 'main' stage ... |
| lilac-hare | Documenting | severity3 | 2025-07-31 | #20284 | [https://gitlab.slack.com/archives/CETG54GQ0/p1753959312265849](https://gitlab.s... |
| Duo Chat consistent errors | Documenting | severity3 | 2025-07-17 | #20217 | Duo Chat experienced consistent errors due to an Anthropic outage. |
| 2025-05-25: Data-Server Rebuil... | Documenting | severity4 | 2025-06-25 | #20072 | This failure is not PROD failure, but REPLICA. Happened yesterday as well Pipeli... |
| 2025-06-24 | Data server rebui... | Documenting | severity4 | 2025-06-24 | #20066 | The data server rebuild failed on a replacedisk job due to the instance being st... |
| Long-running transaction on Sb... | Documenting | severity3 | 2025-06-04 | #19945 | The Sbom::BuildDependencyGraphWorker endpoint on the sidekiq application is runn... |
| AI Gateway: error rate SLO vio... | Documenting | severity3 | 2025-03-11 | #19460 | The AI Gateway is experiencing an error rate SLO violation due to tightening of ... |
| Patroni main SLI dropping | Reviewing | severity1 | 2025-10-09 | #20705 | Problem**: A user-triggered, unoptimized query pattern from a scheduled CI pipel... |
| GitLab.com slow and not loadin... | Reviewing | severity1 | 2025-10-07 | #20688 | Problem**: GitLab.com experienced complete database saturation across the patron... |
| Gitlab unavailable due to load | Reviewing | severity2 | 2025-10-07 | #20686 | Problem**: GitLab became intermittently unavailable for git operations due to se... |
| The gitlab_sshd git service un... | Reviewing | severity2 | 2025-10-07 | #20680 | Problem**: Git operations were degraded in both main and canary environments, co... |
| GitLab.com Shared runners fail... | Reviewing | severity2 | 2025-10-05 | #20664 | Shared runners on GitLab.com failed to upload... |
| Image pull failures on SaaS Li... | Reviewing | severity2 | 2025-09-24 | #20607 | Problem**: SaaS Linux runners have a 15.06% image pull failure rate, exceeding t... |
| Packagecloud service error rat... | Reviewing | severity2 | 2025-07-22 | #20244 | The Packagecloud service is experiencing an elevated error rate due to a large n... |
| Packagecloud service experienc... | Reviewing | severity2 | 2025-07-16 | #20199 | We've seen the issue re-appear later in the day and we implemented Cloudflare WA... |
🔧 Change Management
🔄 In Progress
No change requests currently in progress. No changes completed during this shift.
Generated via woodhouse on 2025-10-21T15:00:00Z