Weekly Reliability (SRE) Team Newsletter – On-call Period: 2021-06-15 - 2021-06-22
<!-- This issue was automatically generated by https://gitlab.com/gitlab-com/gl-infra/oncall-robot-assistant. -->
<!-- Announcements common to all the Reliability (SRE) Teams should be placed in this section. -->
# Announcements
1. Iteration on the incident process: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13605
#### [Engineering Week in Review](https://docs.google.com/document/d/1GQbnOP_lr9KVMVaBQx19WwKITCmh7H3YlgO-XqVwv0M/edit#) Highlights:
<!-- Announcements for each individual SRE Team should be made in their respective sections below. -->
# Team Updates
<!-- xxYZzXcV -->
---
# On-Call During This Period
| Schedule | Username |
| -------- | -------- |
| SRE 8-hour Americas | Cameron McFarland |
| SRE 8-hour Americas | Nels Nelson |
| SRE 8-hour APAC | Cindy Pallares |
| SRE 8-hour APAC | Graeme Gillies |
| SRE 8-hour EMEA | Ahmad Sherif |
## PagerDuty Incidents
[See the 1 week report for acknowledged PD pages](https://nonprod-log.gitlab.net/goto/6fe6cde3a5e9b06d0f10703a2e2f12d4)
### 7 Day Issue Stats
* Oncall issues : **0**
* Access Request : **0**
* Change Issues : **11**
* Incident Issues : **29**
* CorrectiveAction Issues : **0**
#### Change Issues
* 2021-06-21T05:06:04Z - [Change healthcheck on https_git and websockets to be HTTP instead of TCP](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4936)
* 2021-06-18T14:38:49Z - [Migrate large projects off file-51-stor-gprd](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4926)
* 2021-06-18T14:11:15Z - [2021-06-18: Cleanup GPRD TF Plan](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4925)
* 2021-06-18T09:12:01Z - [DRAFT: Upgrade GSTG Prometheus to 2.27.0](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4924)
* 2021-06-18T03:27:40Z - [Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to enable new Rails connection handling for Production](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4921)
* 2021-06-17T18:24:48Z - [Upgrade monitoring-with-count module on GPRD to TF v0.13 and correct problems with metadata startup scripts](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4920)
* 2021-06-17T16:15:56Z - [Migrate large projects off file-50-stor-gprd](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4918)
* 2021-06-15T16:11:03Z - [Upgrade redis-cache to 6.0 in gprd](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4908)
* 2021-06-15T14:49:29Z - [[env:pre] Update VMs to Ubuntu 18.04](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4907)
* 2021-06-15T14:12:50Z - [Upgrade generic-sv-with-group module on GPRD to TF v0.13 and correct problems with metadata startup scripts](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4904)
* 2021-06-15T10:51:44Z - [[GPRD] Bump pgbouncer max_client_conn](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4900)
#### Incident Issues
* 2021-06-21T20:53:17Z - [2021-06-21: Multiple alerts indicating gitlab.com is unreachable](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4945) | reliability~3760139 | ~"Service::GCP" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4945`
* 2021-06-21T14:56:42Z - [2021-06-21: Apdex drop in file-58](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4944) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4944`
* 2021-06-21T10:16:21Z - [2021-06-21 Patch release failure - 13.12 stable branch failing integration tests](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4942) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4942`
* 2021-06-21T08:43:18Z - [2021-06-21: GKE unable to scale due to lack of SSD availability](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4940) | reliability~3760140 | ~"Service::GCP" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4940`
* 2021-06-21T05:59:15Z - [2021-06-21: GitLab Docs search has stopped working](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4939) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4939`
* 2021-06-21T05:58:30Z - [2021-06-21: Docs site search is down](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4938) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4938`
* 2021-06-21T05:55:55Z - [2021-06-21: All shared runners for dev.gitlab.org reporting not available](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4937) | reliability~3760140 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4937`
* 2021-06-20T15:31:34Z - [2021-06-20: The goserver SLI of the gitaly service on node `file-59-stor-gprd.c.gitlab-production.internal` has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4934) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4934`
* 2021-06-20T13:51:18Z - [2021-06-20: Flappy apdex on file-07](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4933) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4933`
* 2021-06-20T03:39:01Z - [2021-06-20: The sentry_events SLI of the sentry service (`main` stage) has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4931) | | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4931`
* 2021-06-20T03:34:33Z - [2021-06-19: The sentry_events SLI has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4930) | reliability~3760142 | ~"Service::Sentry" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4930`
* 2021-06-19T06:48:55Z - [2021-06-18: file-08-stor-gprd.c.gitlab-production.internal is down](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4929) | reliability~3760140 | ~"Service::Git" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4929`
* 2021-06-19T02:17:30Z - [2021-06-18: Postgres pending WAL files on primary is high](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4928) | reliability~3760141 | ~"Service::Postgres" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4928`
* 2021-06-18T23:21:32Z - [2021-06-18: The ssh Services SLI of the frontend service has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4927) | reliability~3760142 | ~"Service::Git" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4927`
* 2021-06-18T08:36:01Z - [2021-06-18: QA test failure on staging](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4922) | reliability~3760140 | ~"Service::GitLab Rails" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4922`
* 2021-06-17T17:35:09Z - [2021-06-17: Increase in runner saturation and runner_requests](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4919) | reliability~3760141 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4919`
* 2021-06-17T16:00:14Z - [Chef client hasn't run for longer than expected](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4917) | | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4917`
* 2021-06-17T15:01:31Z - [2021-06-17 CI Jobs fail with exit code 137](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4916) | reliability~3760141 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4916`
* 2021-06-17T06:21:12Z - [2021-06-17: Large amount of QA failures on PRE environment after upgrading to 14.0](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4915) | reliability~3760140 | ~"Service::Git" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4915`
* 2021-06-16T13:34:26Z - [2021-06-16: Redis cache replicas failing to resync after failover to upgraded node](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4913) | reliability~3760140 | ~"Service::Redis" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4913`
* 2021-06-15T21:26:15Z - [The grafana SLI of the monitoring service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4911) | | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4911`
* 2021-06-15T17:36:37Z - [2021-06-15: The grafana SLI of the monitoring service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4910) | reliability~3760141 | ~"Service::Monitoring" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4910`
* 2021-06-15T16:19:41Z - [2021-06-15: The goserver SLI of the gitaly service on node `file-praefect-02-stor-gprd.c.gitlab-production.internal` has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4909) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4909`
* 2021-06-15T14:45:22Z - [2021-06-15: Apdex drop for canary web](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4906) | reliability~3760141 | ~"Service::Web" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4906`
* 2021-06-15T14:13:44Z - [2021-06-15: Degraded file-51 apdex](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4905) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4905`
* 2021-06-15T13:54:57Z - [2021-06-15 git push errors for some projects during deployment](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4903) | reliability~3760142 | ~"Service::Git" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4903`
* 2021-06-15T13:22:32Z - [2021-06-15: Praefect replicate DB out of date, missing 4 days of data](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4902) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4902`
* 2021-06-15T12:26:46Z - [2021-06-15: The shared_runner_queues SLI of the ci-runners service (`main` stage) has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4901) | reliability~3760141 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4901`
* 2021-06-15T07:42:11Z - [2021-06-15: Chef clients failures have reached critical levels](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4898) | reliability~3760141 | ~"Service::Infrastructure" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4898`
#### CorrectiveAction Issues
* 2021-06-21T09:51:40Z - [Loosen gitaly goserver_op_service error SLO](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13603)
* 2021-06-21T09:01:53Z - [Create list and strategy for naming database clusters in Consul](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13601)
* 2021-06-21T06:12:51Z - [create blackbox probe or pingdom check for docs search](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13599)
* 2021-06-16T15:08:51Z - [The production sidekiq-catchall cluster members can become unresponsive without any alerting](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13577)
### Open Issue Stats
* [Oncall issues](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=oncall) : **4**
* [Change issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=change) : **2**
* [Incident issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=incident) : **41**
* [Access Request](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=access%20request) : **0**
* [CorrectiveAction](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=corrective%20action) : **202**
#### Open Change Issues
<details>
<summary>Show/Hide Table</summary>
| Created | Summary |
| ------- | ------- |
| [2021-06-21T05:06:04Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4936) | Change healthcheck on https_git and websockets to be HTTP instead of TCP |
| [2021-06-18T03:27:40Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4921) | Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to enable new Rails connection handling for Production |
</details>
#### Open Incident Issues
<details>
<summary>Show/Hide Table</summary>
| Created | Summary |
| ------- | ------- |
| [2021-06-21T10:16:21Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4942) | 2021-06-21 Patch release failure - 13.12 stable branch failing integration tests |
| [2021-06-21T08:43:18Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4940) | 2021-06-21: GKE unable to scale due to lack of SSD availability |
| [2021-06-17T15:01:31Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4916) | 2021-06-17 CI Jobs fail with exit code 137 |
| [2021-06-15T13:22:32Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4902) | 2021-06-15: Praefect replicate DB out of date, missing 4 days of data |
| [2021-06-04T10:12:11Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4809) | 2021-06-04: GitLab Runner GPG key and passphrase leaked, need rotating |
| [2021-05-27T15:16:07Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4756) | 2021-05-27: Prometheus has no targets |
| [2021-05-27T14:54:37Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4755) | 2021-05-27: Prometheus has no targets |
</details>
#### Open Oncall Issues
<details>
<summary>Show/Hide Table</summary>
| Created | Summary |
| ------- | ------- |
| [2020-12-18T22:29:14Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12200) | CI clones fail for repositories with a path ending in a period |
| [2020-09-02T13:47:51Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11244) | disable-chef-client isn't preserved over reboots |
| [2020-08-11T16:39:37Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11098) | Investigate slow child pipeline triggering on pre.gitlab.com |
| [2020-03-30T13:38:11Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9660) | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
</details>
issue