Weekly Reliability (SRE) Team Newsletter – On-call Period: 2022-05-17 - 2022-05-24
<!-- This issue was automatically generated by https://gitlab.com/gitlab-com/gl-infra/oncall-robot-assistant. --> <!-- Announcements common to all the Reliability (SRE) Teams should be placed in this section. --> # Announcements #### [Engineering Week in Review](https://docs.google.com/document/d/1GQbnOP_lr9KVMVaBQx19WwKITCmh7H3YlgO-XqVwv0M/edit#) Highlights: <!-- Announcements for each individual SRE Team should be made in their respective sections below. --> # Team Updates <!-- xxYZzXcV --> --- # On-Call During This Period | Schedule | Username | | -------- | -------- | | SRE 8-hour Americas | Alex Hanselka | | SRE 8-hour Americas | Hendrik Meyer | | SRE 8-hour APAC | Craig Barrett | | SRE 8-hour EMEA | Michal Wasilewski | | SRE 8-hour EMEA | Igor Wiedler | ## PagerDuty Incidents [See the 1 week report for acknowledged PD pages](https://nonprod-log.gitlab.net/app/dashboards#/view/dacc1d40-1c64-11ec-b8fd-b5d052b1f8cb?_g=(time:(from:'2022-05-17T01:00:00Z',to:'2022-05-24T01:00:00Z'),filters:!((query:(match_phrase:(type.keyword:pagerduty))),(query:(match_phrase:(status.keyword:triggered)))))) ([long-term trend](https://nonprod-log.gitlab.net/goto/a436702d57864666cd0c8867cfaf73e9)) ### Alerts Volume * [Weekly Trend](https://nonprod-log.gitlab.net/goto/4b20cce0-cb91-11ec-b3a6-472d0398dd6e) * [Monthly Trend](https://nonprod-log.gitlab.net/goto/89e93020-cb91-11ec-b3a6-472d0398dd6e) * [90 Days Trend by Service](https://nonprod-log.gitlab.net/goto/ed862250-cb91-11ec-b3a6-472d0398dd6e) ### 7 Day Issue Stats * Oncall issues : **0** * Access Request : **0** * Change Issues : **17** * Incident Issues : **21** * CorrectiveAction Issues : **1** #### Change Issues * 2022-05-23T06:00:59Z - [Draft: GitLab-SSHD PROXY rolloud [gprd]](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7108) * 2022-05-19T21:26:16Z - [2022-05-23: Staging CI-decomposition Dry-run](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7100) * 2022-05-19T16:07:13Z - [Set the globally allowed IPs in production](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7098) * 2022-05-19T11:49:15Z - [Enable gitlab-sshd on gprd CNY](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7097) * 2022-05-19T10:19:33Z - [Partially reindex data to mitigate potential data loss for Advanced Search](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7096) * 2022-05-19T00:28:45Z - [2022-05-19: Staging](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7095) * 2022-05-18T13:37:39Z - [Set application limits value for pipelines creation rate limit](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7091) * 2022-05-18T11:18:21Z - [Enable CI minutes limit on staging](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7088) * 2022-05-18T06:17:33Z - [2022-05-18: Enable inactive projects deletion on GitLab.com](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7086) * 2022-05-18T02:52:22Z - [Convert gprd "secret" env variables to be pulled from Kubernetes secrets](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7085) * 2022-05-17T14:47:26Z - [Add Embargo Firewall Rule to GPRD](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7079) * 2022-05-17T08:28:56Z - [Retry `BackfillNamespaceIdForProjectRoute` batched background migration](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7076) * 2022-05-17T08:01:01Z - [Enable gitlab-sshd on gprd CNY](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7075) * 2022-05-16T22:27:14Z - [Update kas to v15.0.0 in gprd](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7073) * 2022-05-16T19:36:17Z - [Draft: Deploy infrastructure changes for a PROXY-enabled gitlab-shell on GitLab.com [gstg]](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7072) * 2022-05-16T18:24:23Z - [Upgrade Global Search Elasticsearch cluster `prod-gitlab-com indexing-20200330` to `8.2.0`](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7071) * 2022-05-16T07:49:35Z - [2022-05-16: Lower pages rate limit to 600 req/second per domain](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7066) #### Incident Issues * 2022-05-23T03:40:34Z - [2022-05-23: The goserver SLI of the gitaly service (`cny` stage) has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7107) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7107` * 2022-05-22T16:51:37Z - [2022-05-22: The mainHttpServices SLI of the frontend service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7106) | reliability~3760142 | ~"Service::API" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7106` * 2022-05-22T11:21:53Z - [2022-05-22: Traffic increase on web service impacting error ratio and apdex](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7105) | reliability~3760140 | ~"Service::Web" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7105` * 2022-05-21T23:42:38Z - [2022-05-21: The goserver SLI of the gitaly service on node `file-47-stor-gprd.c.gitlab-production.internal` has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7104) | reliability~3760142 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7104` * 2022-05-20T14:11:40Z - [2022-05-20: The loadbalancer SLI of the web service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7103) | reliability~3760140 | ~"Service::Web" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7103` * 2022-05-20T10:45:03Z - [2022-05-20: The loadbalancer SLI of the websockets service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7102) | reliability~3760141 | ~"Service::Websockets" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7102` * 2022-05-19T16:38:58Z - [2022-05-19: QA specs on 15.0 RC are failing](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7099) | reliability~3760140 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7099` * 2022-05-18T19:43:20Z - [2022-05-18: QA smoke failure on staging-canary](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7094) | reliability~3760140 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7094` * 2022-05-18T16:50:13Z - [2022-05-18 Possible data loss in the Advanced Search Elasticsearch index due to incident-7087](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7093) | reliability~3760142 | ~"Service::Search" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7093` * 2022-05-18T11:47:44Z - [2022-05-18: Conference attendees at KubeCon cannot log into GitLab (stuck on Cloudflare - anti abuse maybe?)](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7090) | reliability~3760141 | ~"Service::Web" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7090` * 2022-05-18T11:28:54Z - [2022-05-18: The loadbalancer SLI of the websockets service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7089) | reliability~3760141 | ~"Service::Websockets" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7089` * 2022-05-18T07:01:26Z - [2022-05-18: Premium and Ultimate features on GitLab.com are unavailable](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7087) | reliability~3760139 | ~"Service::GitLab Rails" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7087` * 2022-05-18T01:41:44Z - [2022-05-18: CI Gateway in us-east1-d experiencing connectivity issues](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7084) | reliability~3760140 | ~"Service::HAProxy" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7084` * 2022-05-17T23:19:04Z - [2022-05-17: The goserver SLI of the gitaly service on gitaly-praefect nodes have error rates violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7083) | reliability~3760140 | ~"Service::Praefect" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7083` * 2022-05-17T18:57:20Z - [2022-05-17: Chef 14 silently replaced with Cinc 15](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7082) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7082` * 2022-05-17T16:25:02Z - [2022-05-17: Runners on ops are not available.](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7081) | reliability~3760140 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7081` * 2022-05-17T12:09:31Z - [2022-05-17: Number of active Gitaly shards is low](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7077) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7077` * 2022-05-17T04:51:14Z - [2022-05-17: LoggingVisibilityDiminished](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7074) | reliability~3760142 | ~"Service::Logging" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7074` * 2022-05-16T09:39:33Z - [2022-05-16: The replicator_queue SLI of the praefect service (`main` stage) has not reported any traffic in the past 30m](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7070) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7070` * 2022-05-16T09:25:23Z - [2022-05-16: Customers are not able to view Project tags page](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7069) | reliability~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7069` * 2022-05-16T07:53:30Z - [2022-05-16: spike in pages errors for fdroid.gitlab.io](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7067) | reliability~3760141 | ~"Service::Pages" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7067` #### CorrectiveAction Issues * 2022-05-19T01:33:47Z - [Corrective action: LoggingVisibilityDiminished](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15757) * 2022-05-18T10:58:03Z - [Corrective action: CI Gateway in us-east1-d experiencing connectivity issues](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15750) * 2022-05-17T12:50:22Z - [Provision 2 gitaly shards to keep up with growth](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15738) * 2022-05-17T10:57:48Z - [Evaluate the use of force-merge in the production logging elasticsearch cluster](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15734) * 2022-05-16T09:59:08Z - [Adjust traffic cessation alerts](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15726) ### Open Issue Stats * [Oncall issues](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=oncall) : **3** * [Change issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=change) : **10** * [Incident issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=incident) : **10** * [Access Request](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=access%20request) : **0** * [CorrectiveAction](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=corrective%20action) : **109** #### Open Change Issues <details> <summary>Show/Hide Table</summary> | Created | Summary | | ------- | ------- | | [2022-05-23T06:00:59Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7108) | Draft: GitLab-SSHD PROXY rolloud [gprd] | | [2022-05-19T21:26:16Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7100) | 2022-05-23: Staging CI-decomposition Dry-run | | [2022-05-19T00:28:45Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7095) | 2022-05-19: Staging | | [2022-05-18T06:17:33Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7086) | 2022-05-18: Enable inactive projects deletion on GitLab.com | | [2022-05-18T02:52:22Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7085) | Convert gprd "secret" env variables to be pulled from Kubernetes secrets | | [2022-05-16T18:24:23Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7071) | Upgrade Global Search Elasticsearch cluster `prod-gitlab-com indexing-20200330` to `8.2.0` | | [2022-05-10T11:45:46Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7022) | Raise tag count limit for ongoing phase 2 container registry migration | | [2022-05-09T23:27:02Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7018) | TBC: VACUUM FULL or pg_repack the merge_request_diff_commits table | | [2022-05-05T11:33:49Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6990) | [gprd] Roll out Git v2.36.1.gl1 | | [2022-04-26T11:23:16Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6915) | Enable gitlab-sshd on gprd (b, c and d) | </details> #### Open Incident Issues <details> <summary>Show/Hide Table</summary> | Created | Summary | | ------- | ------- | | [2022-05-22T16:51:37Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7106) | 2022-05-22: The mainHttpServices SLI of the frontend service (`main` stage) has an error rate violating SLO | | [2022-05-18T16:50:13Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7093) | 2022-05-18 Possible data loss in the Advanced Search Elasticsearch index due to incident-7087 | | [2022-05-10T14:37:51Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7026) | 2022-05-10: ElasticStack production logging cluster is overloaded and unreachable | | [2022-05-10T14:25:44Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7025) | 2022-05-10: The rails_primary_sql SLI of the patroni service (`main` stage) has an apdex violating SLO | | [2022-05-07T13:55:50Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7002) | 2022-05-07: The goserver SLI of the gitaly service on node `file-51-stor-gprd.c.gitlab-production.internal` has an error rate violating SLO | | [2022-04-26T04:40:32Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6914) | 2022-04-26: Postgres exporter is showing errors for the last hour | | [2022-04-12T07:13:49Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6813) | 2022-04-12: gitlab.com replica is not accessible | </details> #### Open Oncall Issues <details> <summary>Show/Hide Table</summary> | Created | Summary | | ------- | ------- | | [2021-09-17T19:35:34Z](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14205) | Proposal: When an Incident is declared, output the latest changed feature flags into the incident issue | | [2020-12-18T22:29:14Z](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/12200) | CI clones fail for repositories with a path ending in a period | | [2020-03-30T13:38:11Z](https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/9660) | jobs.gitlab.com cert expired unnoticed on 2020-03-28 | </details> #### Issues for Review during Incident Review Meeting <details> If there are any incidents you think would be good to review, please add them to the [Agenda](https://docs.google.com/document/d/1Llm9tXHC2dNt_eercRUUXlUyWmOVw00wmXWQQbWvv2c/edit?usp=sharing) for the next meeting. </details>
issue