On-Call Handover 2025-10-21 15:00 UTC

On-Call Handover

Brought to you by woodhouse for EOC shift changes only (not IMOC or CMOC)

👥 Shift Transitions

EOC egress: @swainaina @vglafirov
EOC ingress: @danielryan
IMOC egress: @sashi_kumar
IMOC ingress: @jayswain
CMOC egress: bfreitas supportops tloughlin
CMOC ingress: twilliams bfreitas supportops imason Previous handover: On-Call Handover 2025-10-21 07:00 UTC

📊 Shift Statistics

🔥 2 ongoing incidents
📝 27 incidents in workflow
🔵 2 mitigated incidents
✅ 1 resolved incidents
📟 1 PagerDuty incidents
🔧 0 change requests

📖 Summary

What (if any) time-critical work is being handed over?

What contextual info may be useful for the next few on-call shifts?

🚨 Critical: Ongoing Incidents

GitLab Incidents

Incident	Status	Severity	Created	GitLab	Summary
Zoekt search nodes are not ena...	Fixing	severity3	2025-10-20	#20741	Problem**: Zoekt search nodes went offline due to certificate verification error...
Jobs stuck in running indefini...	Fixing	severity3	2025-09-02	#20469	Problem**: Since early August, some CI jobs remain indefinitely stuck in the run...

🔵 Monitoring: Mitigated Incidents

👀 Under Monitoring (2 incidents)

Incident	Status	Severity	Created	GitLab	Summary
Error rate SLO violation for c...	Monitoring	severity4	2025-10-14	#20722	Problem**: The error rate for CI runner jobs on the saas-linux-large-amd64 shard...
Ongoing ResourceExhausted issu...	Monitoring	severity4	2025-09-18	#20572	Problem**: A surge of expensive commit-related requests—mainly from unauthenti...

✅ Recently Resolved

🎉 Resolved During Shift (2 incidents)

Incident	Status	Severity	Created	GitLab	Summary
No traffic reported by primary...	Closed	severity3	2025-10-21	#20749	Problem**: The primary_server component of the redis-registry-cache system in th...
[#141237] firing - Service red...	resolved	pd_urgencyhigh	2025-10-21	-	[#141237] firing - Service redis-registry-cache (gprd)

📝 Incidents in Workflow

📋 Documenting/Reviewing Phase (27 incidents)

Incident	Status	Severity	Created	GitLab	Summary
dotcom impact from aws outage	Documenting	severity3	2025-10-20	#20739	Problem**: A upstream service disruption caused a ripple effect, including a spi...
Rate Limit - Scheduled Brownou...	Documenting	severity4	2025-10-20	#20738	Problem**: A scheduled brownout has enforced the authenticated API rate limit at...
Monthly release candidate pipe...	Documenting	severity3	2025-10-14	#20721	Problem**: The 'eslint:fe-island' job in the monthly release candidate pipeline ...
Sidekiq queueing apdex SLO vio...	Documenting	severity4	2025-10-06	#20672	Problem**: Sidekiq jobs on the low-urgency-cpu-bound shard experienced increased...
Increased request queuing in a...	Documenting	severity3	2025-10-06	#20671	Problem**: Requests to the API in the cny stage experienced increased queue time...
Search query apdex for zoekt_s...	Documenting	severity4	2025-10-01	#20642	The exact code search component for GitLab.com, backed by Zoekt, had a brief apd...
Apdex SLO violation for zoekt_...	Documenting	severity4	2025-09-30	#20633	The zoekt_searching component for exact code search on GitLab.com has an apdex v...
Sidekiq jobs on low-urgency-cp...	Documenting	severity4	2025-09-30	#20631	The Sidekiq queueing performance for the low-urgency-cpu-bound shard has an apde...
Sidekiq queueing SLI apdex SLO...	Documenting	severity3	2025-09-30	#20628	Problem**: Low-urgency sidekiq jobs were delayed for up to nearly 2 hours due to...
Multiple Versions OF Gitaly Ru...	Documenting	severity4	2025-09-24	#20600	Problem**: One Gitaly node was running an outdated version due to automation fai...
SLO violation in ai-gateway's ...	Documenting	severity4	2025-08-07	#20318	The AI-gateway service in the europe-west2 region is experiencing an Apdex score...
Gitaly goserver SLI apdex viol...	Documenting	severity3	2025-08-06	#20310	Apdex for the Gitaly node began to drop since 2025-08-04, remaining at ~98.5%. W...
Apdex SLO violation in patroni...	Documenting	severity3	2025-08-05	#20308	The Apdex score for SQL transactions in the Patroni service on the 'main' stage ...
lilac-hare	Documenting	severity3	2025-07-31	#20284	[https://gitlab.slack.com/archives/CETG54GQ0/p1753959312265849](https://gitlab.s...
Duo Chat consistent errors	Documenting	severity3	2025-07-17	#20217	Duo Chat experienced consistent errors due to an Anthropic outage.
2025-05-25: Data-Server Rebuil...	Documenting	severity4	2025-06-25	#20072	This failure is not PROD failure, but REPLICA. Happened yesterday as well Pipeli...
2025-06-24 \| Data server rebui...	Documenting	severity4	2025-06-24	#20066	The data server rebuild failed on a replacedisk job due to the instance being st...
Long-running transaction on Sb...	Documenting	severity3	2025-06-04	#19945	The Sbom::BuildDependencyGraphWorker endpoint on the sidekiq application is runn...
AI Gateway: error rate SLO vio...	Documenting	severity3	2025-03-11	#19460	The AI Gateway is experiencing an error rate SLO violation due to tightening of ...
Patroni main SLI dropping	Reviewing	severity1	2025-10-09	#20705	Problem**: A user-triggered, unoptimized query pattern from a scheduled CI pipel...
GitLab.com slow and not loadin...	Reviewing	severity1	2025-10-07	#20688	Problem**: GitLab.com experienced complete database saturation across the patron...
Gitlab unavailable due to load	Reviewing	severity2	2025-10-07	#20686	Problem**: GitLab became intermittently unavailable for git operations due to se...
The gitlab_sshd git service un...	Reviewing	severity2	2025-10-07	#20680	Problem**: Git operations were degraded in both main and canary environments, co...
GitLab.com Shared runners fail...	Reviewing	severity2	2025-10-05	#20664	Shared runners on GitLab.com failed to upload...
Image pull failures on SaaS Li...	Reviewing	severity2	2025-09-24	#20607	Problem**: SaaS Linux runners have a 15.06% image pull failure rate, exceeding t...
Packagecloud service error rat...	Reviewing	severity2	2025-07-22	#20244	The Packagecloud service is experiencing an elevated error rate due to a large n...
Packagecloud service experienc...	Reviewing	severity2	2025-07-16	#20199	We've seen the issue re-appear later in the day and we implemented Cloudflare WA...

🔧 Change Management

🔄 In Progress

No change requests currently in progress. No changes completed during this shift.

Generated via woodhouse on 2025-10-21T15:00:00Z