Weekly Reliability (SRE) Team Newsletter – On-call Period: 2025-10-28 - 2025-11-04
Announcements
Engineering Week in Review Highlights:
Team Updates
On-Call During This Period
| Schedule | Username |
|---|---|
| SRE 8-hour Americas | Alex Hanselka |
| SRE 8-hour Americas | Sarah Walker |
| SRE 8-hour APAC | Pierre Guinoiseau |
| SRE 8-hour APAC | Furhan Shabir |
| SRE 8-hour APAC | Tarun Khandelwal |
| SRE 8-hour EMEA | Ahmad Sherif |
| SRE 8-hour EMEA | Calliope Gardner |
PagerDuty Incidents
See the 1 week report for acknowledged PD pages (long-term trend)
Alerts Volume
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 10
- Incident Issues : 11
- CorrectiveAction Issues : 0
Change Issues
- 2025-10-30T08:15:43Z - [GPRD] Manually upgrade postgresql extension `p... (production#20795)
- 2025-10-29T20:23:54Z - 2025-11-10: Cut over from PagerDuty to incident... (production#20792)
- 2025-10-29T14:22:12Z - 2025-10-23: Test Rails 7.2 Rollout (gstg-cny, g... (production#20789 - closed)
- 2025-10-29T10:03:34Z - 2025-10-30: [GPRD] Update DB duration limit in ... (production#20787 - closed)
- 2025-10-29T06:40:36Z - 2025-10-30: Migrate external-secrets from Helmf... (production#20786 - closed)
- 2025-10-29T02:56:35Z - 2025-10-30: Migrate redis-pubsub gprd from Tank... (production#20785 - closed)
- 2025-10-28T23:37:43Z - Mimir - Add early head compaction based on in m... (production#20783 - closed)
- 2025-10-28T10:37:46Z - [GPRD] Enable SKIP_CREDENTIALS_FETCH for Redis ... (production#20777 - closed)
- 2025-10-28T10:14:23Z - Increase CPU limits for Wiz Sensor Service On G... (production#20776)
- 2025-10-28T09:21:43Z - 2025-10-29: Migrate redis-pubsub gstg from Tank... (production#20775 - closed)
Incident Issues
- 2025-11-01T08:00:10Z - 2025-11-01: Apdex SLO violation for fireworks_a... (production#20799) | severity4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20799 - 2025-10-30T15:35:16Z - 2025-10-30: Sidekiq queueing SLO violation on m... (production#20797) | severity2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20797 - 2025-10-30T13:59:14Z - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20796+ | severity4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20796 - 2025-10-30T08:08:51Z - 2025-10-30: gprd PDM execution failed due to In... (production#20794) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20794 - 2025-10-30T02:00:34Z - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20793+ | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20793 - 2025-10-29T17:43:11Z - 2025-10-29: Delayed Webhooks (production#20791) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20791 - 2025-10-29T14:52:36Z - 2025-10-29: Sidekiq queueing SLI apdex SLO viol... (production#20790) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20790 - 2025-10-29T01:14:56Z - 2025-10-29: Agentic Chat for gitlab.com/gitlab-... (production#20784) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20784 - 2025-10-28T22:22:59Z - 2025-10-28: Teleport connections failing with "... (production#20781) | severity4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20781 - 2025-10-28T16:36:16Z - 2025-10-28: Redirection error on https://versio... (production#20780) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20780 - 2025-10-28T15:25:32Z - 2025-10-28: Mimir (production#20779) | severity2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20779
CorrectiveAction Issues
- 2025-10-30T17:06:45Z - Lower maxreplicas for catchall and low-urgency-... (production-engineering#27872)
- 2025-10-30T17:01:59Z - investigate what `SyncProjectPoliciesWorker` d... (production-engineering#27871 - closed)
- 2025-10-30T14:25:25Z - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27868+
- 2025-10-29T15:44:43Z - Review our queueing SLI for the import-shared-s... (production-engineering#27862)
- 2025-10-29T07:32:53Z - Split AvailableModelsResolver changes into back... (production-engineering#27861)
- 2025-10-28T22:44:32Z - Investigate teleport pods hitting Google Cloud ... (production-engineering#27857)
- 2025-10-28T16:27:28Z - https://gitlab.com/gitlab-com/gl-infra/production-engineering/-/issues/27855+
- 2025-10-28T15:48:53Z - Figure out why mimir broke (production-engineering#27854 - moved)
- 2025-10-28T07:17:35Z - Store periodic host profiles on a disk other th... (production-engineering#27853)
- 2025-10-28T07:15:00Z - Stop previous profiler process before running n... (production-engineering#27852)
Open Issue Stats
- Oncall issues : 1
- Change issues : 16
- Incident issues : 44
- Access Request : 0
- CorrectiveAction : 143
Open Change Issues
Show/Hide Table
Open Incident Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2025-11-01T08:00:10Z | 2025-11-01: Apdex SLO violation for fireworks_a... (production#20799) |
| 2025-10-30T02:00:34Z | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20793+ |
| 2025-10-24T07:16:54Z | 2025-10-24: Duo Agent Platform remote flow rais... (production#20763) |
| 2025-09-02T18:43:53Z | 2025-09-02: Jobs stuck in running indefinitely (production#20469) |
Open Oncall Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2021-09-17T19:35:34Z | Proposal: When an Incident is declared, output ... (production-engineering#14205) |
Issues for Review during Incident Review Meeting
If there are any incidents you think would be good to review, please add them to the Agenda for the next meeting.
Edited by oncall-robot-assistant