Weekly Reliability (SRE) Team Newsletter – On-call Period: 2021-09-14 - 2021-09-21
<!-- This issue was automatically generated by https://gitlab.com/gitlab-com/gl-infra/oncall-robot-assistant. --> <!-- Announcements common to all the Reliability (SRE) Teams should be placed in this section. --> # Announcements #### [Engineering Week in Review](https://docs.google.com/document/d/1GQbnOP_lr9KVMVaBQx19WwKITCmh7H3YlgO-XqVwv0M/edit#) Highlights: <!-- Announcements for each individual SRE Team should be made in their respective sections below. --> # Team Updates <!-- xxYZzXcV --> --- # On-Call During This Period | Schedule | Username | | -------- | -------- | | SRE 8-hour Americas | Alex Hanselka | | SRE 8-hour Americas | Hendrik Meyer | | SRE 8-hour Americas | Nels Nelson | | SRE 8-hour APAC | Craig Miskell | | SRE 8-hour APAC | Pierre Guinoiseau | | SRE 8-hour EMEA | Alejandro Rodriguez | ## PagerDuty Incidents [See the 1 week report for acknowledged PD pages](https://nonprod-log.gitlab.net/goto/6fe6cde3a5e9b06d0f10703a2e2f12d4) ### 7 Day Issue Stats * Oncall issues : **1** * Access Request : **0** * Change Issues : **8** * Incident Issues : **21** * CorrectiveAction Issues : **0** #### Change Issues * 2021-09-17T15:24:25Z - [GPRD: wal-g update to v1.1](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5556) * 2021-09-17T09:49:45Z - [2021-09-17: Set redis-01 to have a higher failover priority](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5554) * 2021-09-17T09:20:12Z - [2021-09-17: Deploy MR to prepare for traffic increase to canary during hard PCL](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5553) * 2021-09-16T01:15:24Z - [Cleanup problematic sessions in redis](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5548) * 2021-09-15T21:21:21Z - [Add memory and disk space to redis](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5547) * 2021-09-15T12:25:00Z - [Enter Nexus credentials on CustomersDot staging & production](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5544) * 2021-09-15T11:41:29Z - [Delete customer specific JiraConnectInstallation record](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5543) * 2021-09-14T09:16:21Z - [2021-09-14: Implement rate limits for the Files API on GitLab.com](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5534) #### Incident Issues * 2021-09-20T10:25:20Z - [2021-09-20: Potentially malicious activity on gitlab pages](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5565) | reliability~3760141 | ~"Service::Pages" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5565` * 2021-09-20T02:35:43Z - [2021-09-20 The GitLab job "walg-basebackup" has failed.](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5562) | reliability~3760142 | ~"Service::Postgres" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5562` * 2021-09-19T21:50:09Z - [2021-09-19: The grpc_requests SLI of the kas service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5561) | reliability~3760141 | ~"Service::KAS" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5561` * 2021-09-19T02:12:57Z - [2021-09-19 The goserver SLI of the gitaly service on node `file-58-stor-gprd.c.gitlab-production.internal` has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5560) | reliability~3760142 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5560` * 2021-09-18T18:11:35Z - [2021-09-18: Review requested for merge request to add CNAME DNS record: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2971](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5559) | reliability~3760142 | ~"Service::Infrastructure" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5559` * 2021-09-18T11:06:50Z - [2021-09-18: Gitaly is down on file-43-stor-gprd.c.gitlab-production.internal](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5558) | reliability~3760140 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5558` * 2021-09-18T06:12:57Z - [2021-09-18 The GitLab job "walg-basebackup" resource "walg-basebackup" has failed.](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5557) | reliability~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5557` * 2021-09-16T15:52:30Z - [2021-09-16 FilesystemFullSoon alerts on runners](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5551) | reliability~3760142 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5551` * 2021-09-16T05:40:28Z - [2021-09-16 The GitLab job "walg-basebackup" resource "walg-basebackup" has failed.](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5550) | reliability~3760141 | ~"Service::Postgres" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5550` * 2021-09-16T03:46:37Z - [2021-09-16 - Users with private commit emails cannot create issues or MRs assigned to themselves in Canary](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5549) | reliability~3760141 | ~"Service::GitLab Rails" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5549` * 2021-09-15T16:31:33Z - [2021-09-15: Redis-persistent memory usage approaching saturation](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5546) | reliability~3760140 | ~"Service::Redis" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5546` * 2021-09-15T16:19:01Z - [2021-09-15: GitLab.com was temporarily unavailable](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5545) | reliability~3760141 | ~"Service::Web" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5545` * 2021-09-15T11:12:45Z - [2021-09-15: The HPA Desired Replicas resource of the sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5542) | reliability~3760141 | ~"Service::Sidekiq" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5542` * 2021-09-15T00:31:22Z - [2021-09-15 The workhorse SLI of the api service violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5541) | reliability~3760141 | ~"Service::API" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5541` * 2021-09-14T20:18:54Z - [2021-09-14: Elevated error rates for GitLab.com](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5539) | reliability~3760140 | ~"Service::API" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5539` * 2021-09-14T14:42:04Z - [2021-09-14: Last successful WAL-G basebackup was seen 30.086379999981986 hours ago for env gprd.](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5537) | reliability~3760141 | ~"Service::Patroni" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5537` * 2021-09-14T10:07:13Z - [2021-09-14: The goserver SLI of the gitaly service on node `file-26-stor-gprd.c.gitlab-production.internal` has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5535) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5535` * 2021-09-14T04:22:09Z - [2021-09-14 Some prometheii lost track of alertmanager](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5532) | reliability~3760142 | ~"Service::Prometheus" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5532` * 2021-09-14T03:54:16Z - [2021-09-14 dev.gitlab.org automatic daily upgrade failed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5531) | reliability~3760140 | ~"Service::GitLab Rails" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5531` * 2021-09-14T01:43:22Z - [2021-09-14 ops.gitlab.net certificate inconsistencies](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5530) | reliability~3760141 | ~"Service::Infrastructure" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5530` * 2021-09-14T00:23:05Z - [Mehdiamer](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5529) | | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5529` #### CorrectiveAction Issues * 2021-09-20T02:58:59Z - [nginx-ingress-controller scale up events result in 502's for clients](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14210) * 2021-09-16T06:39:03Z - [Upgrade Prometheus to 2.30.0](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14196) * 2021-09-15T12:34:38Z - [HAProxy configuration audit with GKE backends](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14190) * 2021-09-14T12:06:37Z - [Manage all GCP project metadata in Terraform](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14177) * 2021-09-14T09:59:47Z - [Alert on high 503 error rate](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14172) * 2021-09-14T05:07:15Z - [Prometheus sometimes doesn't refresh it's alertmanager IPs when they cycle](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14170) * 2021-09-13T23:46:59Z - [Fix gitlab-redis-cli on redis persistent hosts](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14168) ### Open Issue Stats * [Oncall issues](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=oncall) : **4** * [Change issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=change) : **3** * [Incident issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=incident) : **14** * [Access Request](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=access%20request) : **0** * [CorrectiveAction](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=corrective%20action) : **216** #### Open Change Issues <details> <summary>Show/Hide Table</summary> | Created | Summary | | ------- | ------- | | [2021-09-17T15:24:25Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5556) | GPRD: wal-g update to v1.1 | | [2021-09-15T21:21:21Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5547) | Add memory and disk space to redis | | [2021-09-14T09:16:21Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5534) | 2021-09-14: Implement rate limits for the Files API on GitLab.com | </details> #### Open Incident Issues <details> <summary>Show/Hide Table</summary> | Created | Summary | | ------- | ------- | | [2021-09-18T18:11:35Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5559) | 2021-09-18: Review requested for merge request to add CNAME DNS record: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2971 | </details> #### Open Oncall Issues <details> <summary>Show/Hide Table</summary> | Created | Summary | | ------- | ------- | | [2021-09-17T19:35:34Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/14205) | Proposal: When an Incident is declared, output the latest changed feature flags into the incident issue | | [2020-12-18T22:29:14Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12200) | CI clones fail for repositories with a path ending in a period | | [2020-09-02T13:47:51Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11244) | disable-chef-client isn't preserved over reboots | | [2020-03-30T13:38:11Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9660) | jobs.gitlab.com cert expired unnoticed on 2020-03-28 | </details>
issue