Weekly Reliability (SRE) Team Newsletter – On-call Period: 2022-05-17 - 2022-05-24
<!-- This issue was automatically generated by https://gitlab.com/gitlab-com/gl-infra/oncall-robot-assistant. -->
<!-- Announcements common to all the Reliability (SRE) Teams should be placed in this section. -->
# Announcements
#### [Engineering Week in Review](https://docs.google.com/document/d/1GQbnOP_lr9KVMVaBQx19WwKITCmh7H3YlgO-XqVwv0M/edit#) Highlights:
<!-- Announcements for each individual SRE Team should be made in their respective sections below. -->
# Team Updates
<!-- xxYZzXcV -->
---
# On-Call During This Period
| Schedule | Username |
| -------- | -------- |
| SRE 8-hour Americas | Alex Hanselka |
| SRE 8-hour Americas | Hendrik Meyer |
| SRE 8-hour APAC | Craig Barrett |
| SRE 8-hour EMEA | Michal Wasilewski |
| SRE 8-hour EMEA | Igor Wiedler |
## PagerDuty Incidents
[See the 1 week report for acknowledged PD pages](https://nonprod-log.gitlab.net/app/dashboards#/view/dacc1d40-1c64-11ec-b8fd-b5d052b1f8cb?_g=(time:(from:'2022-05-17T01:00:00Z',to:'2022-05-24T01:00:00Z'),filters:!((query:(match_phrase:(type.keyword:pagerduty))),(query:(match_phrase:(status.keyword:triggered)))))) ([long-term trend](https://nonprod-log.gitlab.net/goto/a436702d57864666cd0c8867cfaf73e9))
### Alerts Volume
* [Weekly Trend](https://nonprod-log.gitlab.net/goto/4b20cce0-cb91-11ec-b3a6-472d0398dd6e)
* [Monthly Trend](https://nonprod-log.gitlab.net/goto/89e93020-cb91-11ec-b3a6-472d0398dd6e)
* [90 Days Trend by Service](https://nonprod-log.gitlab.net/goto/ed862250-cb91-11ec-b3a6-472d0398dd6e)
### 7 Day Issue Stats
* Oncall issues : **0**
* Access Request : **0**
* Change Issues : **17**
* Incident Issues : **21**
* CorrectiveAction Issues : **1**
#### Change Issues
* 2022-05-23T06:00:59Z - [Draft: GitLab-SSHD PROXY rolloud [gprd]](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7108)
* 2022-05-19T21:26:16Z - [2022-05-23: Staging CI-decomposition Dry-run](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7100)
* 2022-05-19T16:07:13Z - [Set the globally allowed IPs in production](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7098)
* 2022-05-19T11:49:15Z - [Enable gitlab-sshd on gprd CNY](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7097)
* 2022-05-19T10:19:33Z - [Partially reindex data to mitigate potential data loss for Advanced Search](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7096)
* 2022-05-19T00:28:45Z - [2022-05-19: Staging](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7095)
* 2022-05-18T13:37:39Z - [Set application limits value for pipelines creation rate limit](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7091)
* 2022-05-18T11:18:21Z - [Enable CI minutes limit on staging](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7088)
* 2022-05-18T06:17:33Z - [2022-05-18: Enable inactive projects deletion on GitLab.com](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7086)
* 2022-05-18T02:52:22Z - [Convert gprd "secret" env variables to be pulled from Kubernetes secrets](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7085)
* 2022-05-17T14:47:26Z - [Add Embargo Firewall Rule to GPRD](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7079)
* 2022-05-17T08:28:56Z - [Retry `BackfillNamespaceIdForProjectRoute` batched background migration](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7076)
* 2022-05-17T08:01:01Z - [Enable gitlab-sshd on gprd CNY](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7075)
* 2022-05-16T22:27:14Z - [Update kas to v15.0.0 in gprd](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7073)
* 2022-05-16T19:36:17Z - [Draft: Deploy infrastructure changes for a PROXY-enabled gitlab-shell on GitLab.com [gstg]](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7072)
* 2022-05-16T18:24:23Z - [Upgrade Global Search Elasticsearch cluster `prod-gitlab-com indexing-20200330` to `8.2.0`](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7071)
* 2022-05-16T07:49:35Z - [2022-05-16: Lower pages rate limit to 600 req/second per domain](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7066)
#### Incident Issues
* 2022-05-23T03:40:34Z - [2022-05-23: The goserver SLI of the gitaly service (`cny` stage) has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7107) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7107`
* 2022-05-22T16:51:37Z - [2022-05-22: The mainHttpServices SLI of the frontend service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7106) | reliability~3760142 | ~"Service::API" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7106`
* 2022-05-22T11:21:53Z - [2022-05-22: Traffic increase on web service impacting error ratio and apdex](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7105) | reliability~3760140 | ~"Service::Web" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7105`
* 2022-05-21T23:42:38Z - [2022-05-21: The goserver SLI of the gitaly service on node `file-47-stor-gprd.c.gitlab-production.internal` has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7104) | reliability~3760142 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7104`
* 2022-05-20T14:11:40Z - [2022-05-20: The loadbalancer SLI of the web service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7103) | reliability~3760140 | ~"Service::Web" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7103`
* 2022-05-20T10:45:03Z - [2022-05-20: The loadbalancer SLI of the websockets service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7102) | reliability~3760141 | ~"Service::Websockets" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7102`
* 2022-05-19T16:38:58Z - [2022-05-19: QA specs on 15.0 RC are failing](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7099) | reliability~3760140 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7099`
* 2022-05-18T19:43:20Z - [2022-05-18: QA smoke failure on staging-canary](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7094) | reliability~3760140 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7094`
* 2022-05-18T16:50:13Z - [2022-05-18 Possible data loss in the Advanced Search Elasticsearch index due to incident-7087](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7093) | reliability~3760142 | ~"Service::Search" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7093`
* 2022-05-18T11:47:44Z - [2022-05-18: Conference attendees at KubeCon cannot log into GitLab (stuck on Cloudflare - anti abuse maybe?)](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7090) | reliability~3760141 | ~"Service::Web" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7090`
* 2022-05-18T11:28:54Z - [2022-05-18: The loadbalancer SLI of the websockets service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7089) | reliability~3760141 | ~"Service::Websockets" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7089`
* 2022-05-18T07:01:26Z - [2022-05-18: Premium and Ultimate features on GitLab.com are unavailable](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7087) | reliability~3760139 | ~"Service::GitLab Rails" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7087`
* 2022-05-18T01:41:44Z - [2022-05-18: CI Gateway in us-east1-d experiencing connectivity issues](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7084) | reliability~3760140 | ~"Service::HAProxy" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7084`
* 2022-05-17T23:19:04Z - [2022-05-17: The goserver SLI of the gitaly service on gitaly-praefect nodes have error rates violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7083) | reliability~3760140 | ~"Service::Praefect" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7083`
* 2022-05-17T18:57:20Z - [2022-05-17: Chef 14 silently replaced with Cinc 15](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7082) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7082`
* 2022-05-17T16:25:02Z - [2022-05-17: Runners on ops are not available.](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7081) | reliability~3760140 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7081`
* 2022-05-17T12:09:31Z - [2022-05-17: Number of active Gitaly shards is low](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7077) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7077`
* 2022-05-17T04:51:14Z - [2022-05-17: LoggingVisibilityDiminished](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7074) | reliability~3760142 | ~"Service::Logging" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7074`
* 2022-05-16T09:39:33Z - [2022-05-16: The replicator_queue SLI of the praefect service (`main` stage) has not reported any traffic in the past 30m](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7070) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7070`
* 2022-05-16T09:25:23Z - [2022-05-16: Customers are not able to view Project tags page](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7069) | reliability~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7069`
* 2022-05-16T07:53:30Z - [2022-05-16: spike in pages errors for fdroid.gitlab.io](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7067) | reliability~3760141 | ~"Service::Pages" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7067`
#### CorrectiveAction Issues
* 2022-05-19T01:33:47Z - [Corrective action: LoggingVisibilityDiminished](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15757)
* 2022-05-18T10:58:03Z - [Corrective action: CI Gateway in us-east1-d experiencing connectivity issues](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15750)
* 2022-05-17T12:50:22Z - [Provision 2 gitaly shards to keep up with growth](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15738)
* 2022-05-17T10:57:48Z - [Evaluate the use of force-merge in the production logging elasticsearch cluster](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15734)
* 2022-05-16T09:59:08Z - [Adjust traffic cessation alerts](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15726)
### Open Issue Stats
* [Oncall issues](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=oncall) : **3**
* [Change issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=change) : **10**
* [Incident issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=incident) : **10**
* [Access Request](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=access%20request) : **0**
* [CorrectiveAction](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=corrective%20action) : **109**
#### Open Change Issues
<details>
<summary>Show/Hide Table</summary>
| Created | Summary |
| ------- | ------- |
| [2022-05-23T06:00:59Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7108) | Draft: GitLab-SSHD PROXY rolloud [gprd] |
| [2022-05-19T21:26:16Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7100) | 2022-05-23: Staging CI-decomposition Dry-run |
| [2022-05-19T00:28:45Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7095) | 2022-05-19: Staging |
| [2022-05-18T06:17:33Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7086) | 2022-05-18: Enable inactive projects deletion on GitLab.com |
| [2022-05-18T02:52:22Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7085) | Convert gprd "secret" env variables to be pulled from Kubernetes secrets |
| [2022-05-16T18:24:23Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7071) | Upgrade Global Search Elasticsearch cluster `prod-gitlab-com indexing-20200330` to `8.2.0` |
| [2022-05-10T11:45:46Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7022) | Raise tag count limit for ongoing phase 2 container registry migration |
| [2022-05-09T23:27:02Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7018) | TBC: VACUUM FULL or pg_repack the merge_request_diff_commits table |
| [2022-05-05T11:33:49Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6990) | [gprd] Roll out Git v2.36.1.gl1 |
| [2022-04-26T11:23:16Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6915) | Enable gitlab-sshd on gprd (b, c and d) |
</details>
#### Open Incident Issues
<details>
<summary>Show/Hide Table</summary>
| Created | Summary |
| ------- | ------- |
| [2022-05-22T16:51:37Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7106) | 2022-05-22: The mainHttpServices SLI of the frontend service (`main` stage) has an error rate violating SLO |
| [2022-05-18T16:50:13Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7093) | 2022-05-18 Possible data loss in the Advanced Search Elasticsearch index due to incident-7087 |
| [2022-05-10T14:37:51Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7026) | 2022-05-10: ElasticStack production logging cluster is overloaded and unreachable |
| [2022-05-10T14:25:44Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7025) | 2022-05-10: The rails_primary_sql SLI of the patroni service (`main` stage) has an apdex violating SLO |
| [2022-05-07T13:55:50Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7002) | 2022-05-07: The goserver SLI of the gitaly service on node `file-51-stor-gprd.c.gitlab-production.internal` has an error rate violating SLO |
| [2022-04-26T04:40:32Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6914) | 2022-04-26: Postgres exporter is showing errors for the last hour |
| [2022-04-12T07:13:49Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6813) | 2022-04-12: gitlab.com replica is not accessible |
</details>
#### Open Oncall Issues
<details>
<summary>Show/Hide Table</summary>
| Created | Summary |
| ------- | ------- |
| [2021-09-17T19:35:34Z](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14205) | Proposal: When an Incident is declared, output the latest changed feature flags into the incident issue |
| [2020-12-18T22:29:14Z](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12200) | CI clones fail for repositories with a path ending in a period |
| [2020-03-30T13:38:11Z](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/9660) | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
</details>
#### Issues for Review during Incident Review Meeting
<details>
If there are any incidents you think would be good to review, please add them to the [Agenda](https://docs.google.com/document/d/1Llm9tXHC2dNt_eercRUUXlUyWmOVw00wmXWQQbWvv2c/edit?usp=sharing) for the next meeting.
</details>
issue