Weekly Reliability (SRE) Team Newsletter – On-call Period: 2022-09-20 - 2022-09-27
Announcements
Engineering Week in Review Highlights:
Team Updates
On-Call During This Period
| Schedule | Username |
|---|---|
| SRE 8-hour Americas | Alex Hanselka |
| SRE 8-hour Americas | Matt Smiley |
| SRE 8-hour APAC | Devin Sylva |
| SRE 8-hour EMEA | Ahmad Sherif |
| SRE 8-hour EMEA | Alejandro Rodriguez |
PagerDuty Incidents
See the 1 week report for acknowledged PD pages (long-term trend)
Alerts Volume
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 10
- Incident Issues : 22
- CorrectiveAction Issues : 0
Change Issues
- 2022-09-23T15:35:33Z - Fill in DORA configuration for gitlab-org/gitla... (production#7793 - closed)
- 2022-09-23T02:05:57Z - 2022-09-23: Upgrade gitlab-logs-prod ELK to 7.17 (production#7790 - closed)
- 2022-09-22T03:09:47Z - 2022-09-23: Remove ILM allocation settings and ... (production#7785 - closed)
- 2022-09-21T17:20:28Z - [Production] Stop MigrateSharedVulnerabilitySca... (production#7783 - closed)
- 2022-09-21T15:20:39Z - 2022-09-22: Remove corrupted series from Thanos (production#7782 - closed)
- 2022-09-21T15:18:22Z - [Production] Disable index_ci_builds_metadata_o... (production#7781 - closed)
- 2022-09-21T15:16:37Z - [Staging] Disable index_ci_builds_metadata_on_b... (production#7780 - closed)
- 2022-09-21T08:16:10Z - 2022-09-21: [GPRD] Use turbo mode on restore co... (production#7776 - closed)
- 2022-09-21T08:16:05Z - Update CA certs configurations for registry DB ... (production#7775 - closed)
- 2022-09-20T16:16:49Z - 2022-10-01: GPRD Truncate the rest of CI tables... (production#7770 - closed)
Incident Issues
- 2022-09-24T19:13:00Z - 2022-09-24: PubSub messages queueing up (production#7795 - closed) | reliability~3760141 | ServiceLogging |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7795 - 2022-09-24T13:41:07Z - 2022-09-24: GCP snapshots quota reaching satura... (production#7794 - closed) | reliability~3760142 | ServiceGCP |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7794 - 2022-09-23T12:09:49Z - https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7792+ | reliability~3760141 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7792 - 2022-09-23T03:39:19Z - 2022-09-23: Blackbox probes for https://registr... (production#7791 - closed) | reliability~3760141 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7791 - 2022-09-22T23:44:30Z - 2022-09-22: The GCP Quota utilization per envir... (production#7789 - closed) | reliability~3760141 | ServiceMonitoring-Other |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7789 - 2022-09-22T22:01:34Z - 2022-09-22: Consistient 500 errors when veiwing... (production#7788 - closed) | reliability~3760141 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7788 - 2022-09-22T13:59:15Z - 2022-09-22: PubSub queuing high (production#7787 - closed) | reliability~3760141 | ServiceLogging |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7787 - 2022-09-22T12:03:19Z - 2022-09-22: prometheus-metamon is unreachable (production#7786 - closed) | reliability~3760141 | ServicePrometheus |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7786 - 2022-09-21T17:47:24Z - 2022-09-21: QA Failure in Master: Visiting Proj... (production#7784 - closed) | reliability~3760140 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7784 - 2022-09-21T12:13:17Z - 2022-09-21: Gprd Canary migrations failed (production#7779 - closed) | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7779 - 2022-09-21T10:04:26Z - 2022-09-21: MonitoringServiceThanosQueryFronte... (production#7778 - closed) | reliability~3760141 | ServiceThanos |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7778 - 2022-09-21T06:57:57Z - 2022-09-21: Jobs are not showing for some pipel... (production#7774 - closed) | reliability~3760141 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7774 - 2022-09-20T20:15:40Z - 2022-09-20: thanos-query not responding, unable... (production#7773 - closed) | reliability~3760139 | ServiceThanos |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7773 - 2022-09-20T17:54:40Z - 2022-09-20: prometheus pods continue to restart... (production#7772 - closed) | reliability~3760141 | ServicePrometheus |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7772 - 2022-09-20T17:54:14Z - 2022-09-20: PubSub queuing high (production#7771 - closed) | reliability~3760141 | ServiceLogging |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7771 - 2022-09-20T15:01:24Z - 2022-09-20: Post-deploy migration failed on pre (production#7769 - closed) | reliability~3760140 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7769 - 2022-09-20T13:21:38Z - 2022-09-20: Jobs on runner-01-inf-ops intermitt... (production#7768 - closed) | reliability~3760142 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7768 - 2022-09-20T10:48:34Z - 2022-09-20: Job with image failing on Windows s... (production#7767 - closed) | reliability~3760142 | ServiceWindows CI Runner |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7767 - 2022-09-20T05:26:25Z - 2022-09-20: fatal: [patroni-ci-data-analytics-0... (production#7766 - closed) | reliability~3760141 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7766 - 2022-09-19T22:20:13Z - 2022-09-19: Increase 500 error rate in SAML log... (production#7765 - closed) | reliability~3760141 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7765 - 2022-09-19T11:59:06Z - 2022-09-19: Failing migrations on gstg (production#7762 - closed) | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7762 - 2022-09-19T11:50:02Z - 2022-09-19: APT/Chef is broken on Ubuntu Xenial... (production#7761 - closed) | reliability~3760141 | ServiceChef |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7761
CorrectiveAction Issues
- 2022-09-20T19:27:57Z - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16431+
- 2022-09-20T19:06:16Z - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16430+
- 2022-09-19T11:55:56Z - https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/16416+
Open Issue Stats
- Oncall issues : 2
- Change issues : 23
- Incident issues : 10
- Access Request : 0
- CorrectiveAction : 95
Open Change Issues
Show/Hide Table
Open Incident Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2022-09-22T22:01:34Z | 2022-09-22: Consistient 500 errors when veiwing... (production#7788 - closed) |
| 2022-09-20T17:54:40Z | 2022-09-20: prometheus pods continue to restart... (production#7772 - closed) |
Open Oncall Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2021-09-17T19:35:34Z | https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14205+ |
| 2020-12-18T22:29:14Z | https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12200+ |
Issues for Review during Incident Review Meeting
If there are any incidents you think would be good to review, please add them to the Agenda for the next meeting.
Edited by ops-gitlab-net