Weekly Reliability (SRE) Team Newsletter – On-call Period: 2024-06-18 - 2024-06-25
Announcements
Engineering Week in Review Highlights:
Team Updates
On-Call During This Period
Schedule | Username |
---|---|
SRE 8-hour Americas | Cameron McFarland |
SRE 8-hour Americas | Marcel Chacon |
SRE 8-hour APAC | Gonzalo Servat |
SRE 8-hour APAC | Nick Duff |
SRE 8-hour EMEA | Igor Wiedler |
SRE 8-hour EMEA | Jack Stephenson |
PagerDuty Incidents
See the 1 week report for acknowledged PD pages (long-term trend)
Alerts Volume
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 10
- Incident Issues : 12
- CorrectiveAction Issues : 0
Change Issues
- 2024-06-24T01:50:25Z - 2024-06-24: Decommission nonprod legacy GKE Pro... (production#18181 - closed)
- 2024-06-23T23:22:11Z - [GSTG] - Remove redundant snapshots from stagin... (production#18180 - closed)
- 2024-06-21T13:27:19Z - 2024-06-21: Update .com Admin Setting for Produ... (production#18177 - closed)
- 2024-06-20T05:18:33Z - 2024-06-20: [non-production] Remove gitlab_unst... (production#18172 - closed)
- 2024-06-19T08:50:07Z - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18169+
- 2024-06-18T17:54:50Z - [CR] [grpd] Enable non-tableflip Gitaly restart... (production#18166 - closed)
- 2024-06-18T15:51:21Z - CR [GSTG] Remove dedicated WAL disk test node f... (production#18164 - closed)
- 2024-06-18T15:10:02Z - CR [GPRD] Reduce max_wal_size from 128GB to 64GB (production#18163 - closed)
- 2024-06-18T12:47:15Z - [GPRD] Rollout of `ci_current_partition_value_102` (production#18161 - closed)
- 2024-06-18T08:25:56Z - Add slowlog to Global Search ES indexes for pro... (production#18159)
Incident Issues
- 2024-06-24T05:32:55Z - 2024-06-24: web error rate increase (cny) (production#18182 - closed) | severity3 | ServiceNeeded |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18182
- 2024-06-22T19:30:20Z - 2024-06-22: WebServiceRailsRequestErrorSLOViola... (production#18179 - closed) | severity3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18179
- 2024-06-21T10:49:45Z - 2024-06-21: PG transaction deadlock (production#18176 - closed) | severity3 | ServiceNeeded |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18176
- 2024-06-20T18:59:53Z - 2024-06-20: 502 Bad Gateway on gitlab.com (production#18175 - closed) | severity2 | ServiceCloudflare |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18175
- 2024-06-20T09:42:18Z - 2024-06-20: Uptick in 429 errors, unexpected au... (production#18173 - closed) | severity2 | ServiceHAProxy |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18173
- 2024-06-19T20:56:46Z - 2024-06-19: GCPScheduledSnapshotsFailed patroni... (production#18171 - closed) | severity4 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18171
- 2024-06-19T09:24:58Z - 2024-06-19: MacOS runners are not picking up jobs (production#18170 - closed) | severity3 | ServiceNeeded |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18170
- 2024-06-19T04:28:32Z - 2024-06-19: Patroni Consul service has more tha... (production#18168 - closed) | severity3 | ServiceNeeded |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18168
- 2024-06-18T18:30:45Z - 2024-06-18: pgbouncer sidekiq saturation (production#18167 - closed) | severity3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18167
- 2024-06-18T16:52:42Z - 2024-06-18: gprd-gitaly job failed due to unrea... (production#18165 - closed) | severity3 | ServiceDeployTooling |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18165
- 2024-06-18T14:07:32Z - 2024-06-18: Short apdex dip on kube apiserver (production#18162 - closed) | severity4 | ~"Service::Monitoring" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18162
- 2024-06-18T08:30:41Z - 2024-06-18: n2d CPU quota reached for gitlab-ci... (production#18160 - closed) | severity4 | ~"Service::Monitoring" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18160
CorrectiveAction Issues
Open Issue Stats
- Oncall issues : 1
- Change issues : 16
- Incident issues : 1
- Access Request : 0
- CorrectiveAction : 52
Open Change Issues
Show/Hide Table
Open Incident Issues
Show/Hide Table
Created | Summary |
---|---|
2024-06-17T11:40:16Z | 2024-06-17: gitlab-org/gitlab pipelines get `fa... (production#18156 - closed) |
Open Oncall Issues
Show/Hide Table
Created | Summary |
---|---|
2021-09-17T19:35:34Z | Proposal: When an Incident is declared, output ... (production-engineering#14205) |
Issues for Review during Incident Review Meeting
If there are any incidents you think would be good to review, please add them to the Agenda for the next meeting.
Edited by ops-gitlab-net