On-Call Handover 2025-10-21 07:00 UTC

On-Call Handover

Brought to you by woodhouse for EOC shift changes only (not IMOC or CMOC)

👥 Shift Transitions

EOC egress: @ayeung
EOC ingress: @swainaina @vglafirov
IMOC egress: @reprazent
IMOC ingress: @reprazent
CMOC egress: supportops kkamiya
CMOC ingress: bfreitas supportops kkamiya tloughlin Previous handover: On-Call Handover 2025-10-20 23:00 UTC

📊 Shift Statistics

🔥 2 ongoing incidents
📝 28 incidents in workflow
🔵 2 mitigated incidents
✅ 2 resolved incidents
📟 7 PagerDuty incidents
🔧 2 change requests

📖 Summary

What (if any) time-critical work is being handed over?

We have 2 low-severity incidents in a holding pattern at the moment:

https://app.incident.io/gitlab/incidents/5089 we attempted to rollout some changes to the webservice readiness checks in the Kube clusters. Unfortunately it seems we hit an edge case with it when deployments are happening and pods get terminated - lots of requests end up going to pods that are shutting down and get 502 responses. We've reverted this change, but have not yet had another deployment to validate that the problem is actually gone.
- What you need to do: during the next deployment to canary, see if you get alerted for 5xx errors at the loadbalancer for the web/web-pages services. If you don't get alerted then everything is OK, it means that we were right about the cause of the problem. If not... 😬
https://app.incident.io/gitlab/incidents/5070 - Zoekt got broken. It's fixed now, but the entire site needs to be reindexed.
- What you need to do: not much, the advanced search team are largely running this themselves, but to help them out you could run gitlab-rake gitlab:zoekt:info from time to time to check on the progress of the reindexing.

What contextual info may be useful for the next few on-call shifts?

🚨 Critical: Ongoing Incidents

GitLab Incidents

Incident	Status	Severity	Created	GitLab	Summary
Zoekt search nodes are not ena...	Fixing	severity3	2025-10-20	#20741	Problem**: Zoekt search nodes went offline due to certificate verification error...
Jobs stuck in running indefini...	Fixing	severity3	2025-09-02	#20469	Problem**: Since early August, some CI jobs remain indefinitely stuck in the run...

🔵 Monitoring: Mitigated Incidents

👀 Under Monitoring (2 incidents)

Incident	Status	Severity	Created	GitLab	Summary
Error rate SLO violation for c...	Monitoring	severity4	2025-10-14	#20722	Problem**: The error rate for CI runner jobs on the saas-linux-large-amd64 shard...
Ongoing ResourceExhausted issu...	Monitoring	severity4	2025-09-18	#20572	Problem**: A surge of expensive commit-related requests—mainly from unauthenti...

✅ Recently Resolved

🎉 Resolved During Shift (9 incidents)

Incident	Status	Severity	Created	GitLab	Summary
rails_request error rate viola...	Closed	severity3	2025-10-20	#20740	Problem**: Elevated error rates for the ai-assisted and ai-gateway services due ...
generate-facts job in omnibus-...	Closed	severity3	2025-10-20	#20736	`generate-facts` job in omnibus-gitlab pipeline is failing due to a missing tag ...
[#141157] firing - Service web...	resolved	pd_urgencyhigh	2025-10-20	-	[#141157] firing - Service web (gprd)
[#141159] firing - Service web...	resolved	pd_urgencyhigh	2025-10-20	-	[#141159] firing - Service web-pages (gprd)
[#141187] firing - Service web...	resolved	pd_urgencyhigh	2025-10-21	-	[#141187] firing - Service web (gprd)
[#141189] firing - Service web...	resolved	pd_urgencyhigh	2025-10-21	-	[#141189] firing - Service web-pages (gprd)
[#141193] firing - Service web...	resolved	pd_urgencyhigh	2025-10-21	-	[#141193] firing - Service web (gprd)
[#141194] firing - Service web...	resolved	pd_urgencyhigh	2025-10-21	-	[#141194] firing - Service web-pages (gprd)
[#141216] firing - Service sid...	resolved	pd_urgencyhigh	2025-10-21	-	[#141216] firing - Service sidekiq (gprd)

📝 Incidents in Workflow

📋 Documenting/Reviewing Phase (28 incidents)

Incident	Status	Severity	Created	GitLab	Summary
dotcom impact from aws outage	Documenting	severity3	2025-10-20	#20739	Problem**: A upstream service disruption caused a ripple effect, including a spi...
Rate Limit - Scheduled Brownou...	Documenting	severity4	2025-10-20	#20738	Problem**: A scheduled brownout has enforced the authenticated API rate limit at...
Monthly release candidate pipe...	Documenting	severity3	2025-10-14	#20721	Problem**: The 'eslint:fe-island' job in the monthly release candidate pipeline ...
Sidekiq queueing apdex SLO vio...	Documenting	severity4	2025-10-06	#20672	Problem**: Sidekiq jobs on the low-urgency-cpu-bound shard experienced increased...
Increased request queuing in a...	Documenting	severity3	2025-10-06	#20671	Problem**: Requests to the API in the cny stage experienced increased queue time...
Search query apdex for zoekt_s...	Documenting	severity4	2025-10-01	#20642	The exact code search component for GitLab.com, backed by Zoekt, had a brief apd...
Apdex SLO violation for zoekt_...	Documenting	severity4	2025-09-30	#20633	The zoekt_searching component for exact code search on GitLab.com has an apdex v...
Sidekiq jobs on low-urgency-cp...	Documenting	severity4	2025-09-30	#20631	The Sidekiq queueing performance for the low-urgency-cpu-bound shard has an apde...
Sidekiq queueing SLI apdex SLO...	Documenting	severity3	2025-09-30	#20628	Problem**: Low-urgency sidekiq jobs were delayed for up to nearly 2 hours due to...
Multiple Versions OF Gitaly Ru...	Documenting	severity4	2025-09-24	#20600	Problem**: One Gitaly node was running an outdated version due to automation fai...
SLO violation in ai-gateway's ...	Documenting	severity4	2025-08-07	#20318	The AI-gateway service in the europe-west2 region is experiencing an Apdex score...
Gitaly goserver SLI apdex viol...	Documenting	severity3	2025-08-06	#20310	Apdex for the Gitaly node began to drop since 2025-08-04, remaining at ~98.5%. W...
Apdex SLO violation in patroni...	Documenting	severity3	2025-08-05	#20308	The Apdex score for SQL transactions in the Patroni service on the 'main' stage ...
lilac-hare	Documenting	severity3	2025-07-31	#20284	[https://gitlab.slack.com/archives/CETG54GQ0/p1753959312265849](https://gitlab.s...
Duo Chat consistent errors	Documenting	severity3	2025-07-17	#20217	Duo Chat experienced consistent errors due to an Anthropic outage.
2025-05-25: Data-Server Rebuil...	Documenting	severity4	2025-06-25	#20072	This failure is not PROD failure, but REPLICA. Happened yesterday as well Pipeli...
2025-06-24 \| Data server rebui...	Documenting	severity4	2025-06-24	#20066	The data server rebuild failed on a replacedisk job due to the instance being st...
Long-running transaction on Sb...	Documenting	severity3	2025-06-04	#19945	The Sbom::BuildDependencyGraphWorker endpoint on the sidekiq application is runn...
AI Gateway: error rate SLO vio...	Documenting	severity3	2025-03-11	#19460	The AI Gateway is experiencing an error rate SLO violation due to tightening of ...
Patroni main SLI dropping	Reviewing	severity1	2025-10-09	#20705	Problem**: A user-triggered, unoptimized query pattern from a scheduled CI pipel...
GitLab.com slow and not loadin...	Reviewing	severity1	2025-10-07	#20688	Problem**: GitLab.com experienced complete database saturation across the patron...
Gitlab unavailable due to load	Reviewing	severity2	2025-10-07	#20686	Problem**: GitLab became intermittently unavailable for git operations due to se...
The gitlab_sshd git service un...	Reviewing	severity2	2025-10-07	#20680	Problem**: Git operations were degraded in both main and canary environments, co...
GitLab.com Shared runners fail...	Reviewing	severity2	2025-10-05	#20664	Shared runners on GitLab.com failed to upload...
Image pull failures on SaaS Li...	Reviewing	severity2	2025-09-24	#20607	Problem**: SaaS Linux runners have a 15.06% image pull failure rate, exceeding t...
Packagecloud service error rat...	Reviewing	severity2	2025-07-22	#20244	The Packagecloud service is experiencing an elevated error rate due to a large n...
Packagecloud service experienc...	Reviewing	severity2	2025-07-16	#20199	We've seen the issue re-appear later in the day and we implemented Cloudflare WA...
Loadbalancer 5xx error rate fo...	Paused	severity3	2025-10-20	#20745	Problem**: A spike in 5xx errors affected both web and web-pages services in the...

🔧 Change Management

🔄 In Progress

No change requests currently in progress.

✅ Completed During Shift

📋 Closed Changes (2 requests)

Change Request	Closed	GitLab	Status
2025-10-21: Migrate PlantUML deployment from Tanka to ArgoCD	2025-10-21	GitLab	✅ Closed
Migrate fluentd-archiver to vector for gprd	2025-10-21	GitLab	✅ Closed

Generated via woodhouse on 2025-10-21T07:00:00Z

Edited Oct 21, 2025 by Adeline Yeung