Weekly Reliability (SRE) Team Newsletter – On-call Period: 2025-08-26 - 2025-09-02
Announcements
Engineering Week in Review Highlights:
Team Updates
On-Call During This Period
Schedule | Username |
---|---|
SRE 8-hour Americas | Cameron McFarland |
SRE 8-hour Americas | Dan Ryan |
SRE 8-hour APAC | Anton Starovoytov |
SRE 8-hour APAC | Tarun Khandelwal |
SRE 8-hour EMEA | Silvester Wainaina |
SRE 8-hour EMEA | Jack Stephenson |
SRE 8-hour EMEA | Florian Forster |
PagerDuty Incidents
See the 1 week report for acknowledged PD pages (long-term trend)
Alerts Volume
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 10
- Incident Issues : 16
- CorrectiveAction Issues : 0
Change Issues
- 2025-09-01T06:32:31Z - Enable setting "Enforce PIPL compliance" (production#20457 - closed)
- 2025-08-29T21:37:50Z - [gprd] Fix chef converging on VMs (production#20453 - closed)
- 2025-08-28T16:24:03Z - [gprd] Add a new CI replica database to the CI ... (production#20443 - closed)
- 2025-08-27T20:08:28Z - [GPRD] [2025-10-24 to 2025-10-28] - Upgrade Pos... (production#20440)
- 2025-08-27T20:08:22Z - [GPRD] [2025-09-26 to 2025-09-30] - Upgrade Pos... (production#20439)
- 2025-08-27T19:52:17Z - [GSTG] [2025-10-02 to 2025-10-03] - Upgrade Pos... (production#20438)
- 2025-08-27T19:51:53Z - [GSTG] [2025-09-19 to 2025-09-20] - Upgrade Pos... (production#20437)
- 2025-08-27T13:46:14Z - 2025-08-27: macos runners update to 18.3.0~pre.... (production#20435 - closed)
- 2025-08-26T12:12:57Z - 2025-08-27: macos runner host AMI upgrade and r... (production#20430 - closed)
- 2025-08-26T11:00:59Z - [GPRD] Vulnerabilities Scan result API availabl... (production#20428)
Incident Issues
- 2025-09-01T02:14:43Z - 2025-09-01: No traffic detected by runway_lb lo... (production#20456 - closed) | severity4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20456
- 2025-08-31T16:54:10Z - 2025-08-31: Sidekiq queueing apdex SLO violatio... (production#20455 - closed) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20455
- 2025-08-30T05:47:51Z - 2025-08-30: Unable to start containers for logg... (production#20454 - closed) | severity4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20454
- 2025-08-29T19:15:18Z - 2025-08-29: Several Chef managed VMs cannot con... (production#20452) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20452
- 2025-08-29T15:09:09Z - 2025-08-29: Timeouts for MR Vulnerability Checks (production#20450 - closed) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20450
- 2025-08-29T06:29:33Z - 2025-08-29: `gprd` post-deploy migration with I... (production#20449) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20449
- 2025-08-29T02:59:35Z - 2025-08-29: gstg-cny deployment failing due to ... (production#20448 - closed) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20448
- 2025-08-28T22:54:21Z - 2025-08-28: Sidekiq queueing SLO violation on c... (production#20447 - closed) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20447
- 2025-08-28T21:20:41Z - 2025-08-28: Sidekiq job queueing duration SLO v... (production#20446 - closed) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20446
- 2025-08-28T13:19:54Z - 2025-08-28: Rails replica SQL transactions from... (production#20442) | severity2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20442
- 2025-08-27T10:40:18Z - 2025-08-27: 500 Errors while pushing file renam... (production#20434 - closed) | severity2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20434
- 2025-08-27T09:28:40Z - 2025-08-27: Sidekiq execution error rate SLO vi... (production#20433 - closed) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20433
- 2025-08-27T08:21:23Z - 2025-08-27: Disk space utilization on gitaly no... (production#20432 - closed) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20432
- 2025-08-26T15:40:32Z - 2025-08-26: Apdex SLO violation for workhorse w... (production#20431 - closed) | severity4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20431
- 2025-08-26T11:31:14Z - 2025-08-26: Workhorse apdex SLO violation on ap... (production#20429 - closed) | severity4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20429
- 2025-08-26T08:31:03Z - 2025-08-26: Production post deployment migratio... (production#20427) | severity3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20427
CorrectiveAction Issues
- 2025-08-30T13:58:35Z - Ensure Chef is restarted and working on console... (production-engineering#27527 - closed)
- 2025-08-30T10:17:12Z - Scale the covered-experiences pubsubbeat deploy... (production-engineering#27526)
- 2025-08-30T10:16:42Z - Create topic for covered-experiences pubsubbeat... (production-engineering#27525 - closed)
- 2025-08-30T10:16:00Z - Update alerting aggregations to allow targetted... (production-engineering#27524)
- 2025-08-29T13:38:17Z - Implement alternative migration using disable_d... (production-engineering#27510)
- 2025-08-28T19:21:04Z - Add an alert for the LWLock state saturation state (production-engineering#27492)
- 2025-08-28T17:15:00Z - Update or Create Runbook for identifying LWLock... (production-engineering#27491 - closed)
- 2025-08-28T17:02:35Z - We should add a chart of LWLocks to each databa... (production-engineering#27490)
- 2025-08-26T21:01:08Z - Metrics should allow for custom thresholds per ... (production-engineering#27476)
- 2025-08-26T16:11:14Z - Re-submit the migration with the with_lock_retr... (production-engineering#27473 - closed)
Open Issue Stats
- Oncall issues : 1
- Change issues : 17
- Incident issues : 41
- Access Request : 0
- CorrectiveAction : 149
Open Change Issues
Show/Hide Table
Open Incident Issues
Show/Hide Table
Created | Summary |
---|---|
2025-09-01T02:14:43Z | 2025-09-01: No traffic detected by runway_lb lo... (production#20456 - closed) |
2025-08-29T19:15:18Z | 2025-08-29: Several Chef managed VMs cannot con... (production#20452) |
Open Oncall Issues
Show/Hide Table
Created | Summary |
---|---|
2021-09-17T19:35:34Z | Proposal: When an Incident is declared, output ... (production-engineering#14205) |
Issues for Review during Incident Review Meeting
If there are any incidents you think would be good to review, please add them to the Agenda for the next meeting.
Edited by oncall-robot-assistant