Weekly Reliability (SRE) Team Newsletter – On-call Period: 2021-06-15 - 2021-06-22
<!-- This issue was automatically generated by https://gitlab.com/gitlab-com/gl-infra/oncall-robot-assistant. --> <!-- Announcements common to all the Reliability (SRE) Teams should be placed in this section. --> # Announcements 1. Iteration on the incident process: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13605 #### [Engineering Week in Review](https://docs.google.com/document/d/1GQbnOP_lr9KVMVaBQx19WwKITCmh7H3YlgO-XqVwv0M/edit#) Highlights: <!-- Announcements for each individual SRE Team should be made in their respective sections below. --> # Team Updates <!-- xxYZzXcV --> --- # On-Call During This Period | Schedule | Username | | -------- | -------- | | SRE 8-hour Americas | Cameron McFarland | | SRE 8-hour Americas | Nels Nelson | | SRE 8-hour APAC | Cindy Pallares | | SRE 8-hour APAC | Graeme Gillies | | SRE 8-hour EMEA | Ahmad Sherif | ## PagerDuty Incidents [See the 1 week report for acknowledged PD pages](https://nonprod-log.gitlab.net/goto/6fe6cde3a5e9b06d0f10703a2e2f12d4) ### 7 Day Issue Stats * Oncall issues : **0** * Access Request : **0** * Change Issues : **11** * Incident Issues : **29** * CorrectiveAction Issues : **0** #### Change Issues * 2021-06-21T05:06:04Z - [Change healthcheck on https_git and websockets to be HTTP instead of TCP](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4936) * 2021-06-18T14:38:49Z - [Migrate large projects off file-51-stor-gprd](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4926) * 2021-06-18T14:11:15Z - [2021-06-18: Cleanup GPRD TF Plan](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4925) * 2021-06-18T09:12:01Z - [DRAFT: Upgrade GSTG Prometheus to 2.27.0](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4924) * 2021-06-18T03:27:40Z - [Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to enable new Rails connection handling for Production](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4921) * 2021-06-17T18:24:48Z - [Upgrade monitoring-with-count module on GPRD to TF v0.13 and correct problems with metadata startup scripts](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4920) * 2021-06-17T16:15:56Z - [Migrate large projects off file-50-stor-gprd](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4918) * 2021-06-15T16:11:03Z - [Upgrade redis-cache to 6.0 in gprd](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4908) * 2021-06-15T14:49:29Z - [[env:pre] Update VMs to Ubuntu 18.04](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4907) * 2021-06-15T14:12:50Z - [Upgrade generic-sv-with-group module on GPRD to TF v0.13 and correct problems with metadata startup scripts](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4904) * 2021-06-15T10:51:44Z - [[GPRD] Bump pgbouncer max_client_conn](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4900) #### Incident Issues * 2021-06-21T20:53:17Z - [2021-06-21: Multiple alerts indicating gitlab.com is unreachable](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4945) | reliability~3760139 | ~"Service::GCP" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4945` * 2021-06-21T14:56:42Z - [2021-06-21: Apdex drop in file-58](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4944) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4944` * 2021-06-21T10:16:21Z - [2021-06-21 Patch release failure - 13.12 stable branch failing integration tests](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4942) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4942` * 2021-06-21T08:43:18Z - [2021-06-21: GKE unable to scale due to lack of SSD availability](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4940) | reliability~3760140 | ~"Service::GCP" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4940` * 2021-06-21T05:59:15Z - [2021-06-21: GitLab Docs search has stopped working](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4939) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4939` * 2021-06-21T05:58:30Z - [2021-06-21: Docs site search is down](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4938) | reliability~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4938` * 2021-06-21T05:55:55Z - [2021-06-21: All shared runners for dev.gitlab.org reporting not available](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4937) | reliability~3760140 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4937` * 2021-06-20T15:31:34Z - [2021-06-20: The goserver SLI of the gitaly service on node `file-59-stor-gprd.c.gitlab-production.internal` has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4934) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4934` * 2021-06-20T13:51:18Z - [2021-06-20: Flappy apdex on file-07](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4933) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4933` * 2021-06-20T03:39:01Z - [2021-06-20: The sentry_events SLI of the sentry service (`main` stage) has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4931) | | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4931` * 2021-06-20T03:34:33Z - [2021-06-19: The sentry_events SLI has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4930) | reliability~3760142 | ~"Service::Sentry" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4930` * 2021-06-19T06:48:55Z - [2021-06-18: file-08-stor-gprd.c.gitlab-production.internal is down](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4929) | reliability~3760140 | ~"Service::Git" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4929` * 2021-06-19T02:17:30Z - [2021-06-18: Postgres pending WAL files on primary is high](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4928) | reliability~3760141 | ~"Service::Postgres" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4928` * 2021-06-18T23:21:32Z - [2021-06-18: The ssh Services SLI of the frontend service has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4927) | reliability~3760142 | ~"Service::Git" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4927` * 2021-06-18T08:36:01Z - [2021-06-18: QA test failure on staging](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4922) | reliability~3760140 | ~"Service::GitLab Rails" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4922` * 2021-06-17T17:35:09Z - [2021-06-17: Increase in runner saturation and runner_requests](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4919) | reliability~3760141 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4919` * 2021-06-17T16:00:14Z - [Chef client hasn't run for longer than expected](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4917) | | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4917` * 2021-06-17T15:01:31Z - [2021-06-17 CI Jobs fail with exit code 137](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4916) | reliability~3760141 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4916` * 2021-06-17T06:21:12Z - [2021-06-17: Large amount of QA failures on PRE environment after upgrading to 14.0](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4915) | reliability~3760140 | ~"Service::Git" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4915` * 2021-06-16T13:34:26Z - [2021-06-16: Redis cache replicas failing to resync after failover to upgraded node](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4913) | reliability~3760140 | ~"Service::Redis" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4913` * 2021-06-15T21:26:15Z - [The grafana SLI of the monitoring service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4911) | | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4911` * 2021-06-15T17:36:37Z - [2021-06-15: The grafana SLI of the monitoring service (`main` stage) has an error rate violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4910) | reliability~3760141 | ~"Service::Monitoring" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4910` * 2021-06-15T16:19:41Z - [2021-06-15: The goserver SLI of the gitaly service on node `file-praefect-02-stor-gprd.c.gitlab-production.internal` has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4909) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4909` * 2021-06-15T14:45:22Z - [2021-06-15: Apdex drop for canary web](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4906) | reliability~3760141 | ~"Service::Web" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4906` * 2021-06-15T14:13:44Z - [2021-06-15: Degraded file-51 apdex](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4905) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4905` * 2021-06-15T13:54:57Z - [2021-06-15 git push errors for some projects during deployment](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4903) | reliability~3760142 | ~"Service::Git" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4903` * 2021-06-15T13:22:32Z - [2021-06-15: Praefect replicate DB out of date, missing 4 days of data](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4902) | reliability~3760141 | ~"Service::Gitaly" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4902` * 2021-06-15T12:26:46Z - [2021-06-15: The shared_runner_queues SLI of the ci-runners service (`main` stage) has an apdex violating SLO](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4901) | reliability~3760141 | ~"Service::CI Runners" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4901` * 2021-06-15T07:42:11Z - [2021-06-15: Chef clients failures have reached critical levels](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4898) | reliability~3760141 | ~"Service::Infrastructure" | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4898` #### CorrectiveAction Issues * 2021-06-21T09:51:40Z - [Loosen gitaly goserver_op_service error SLO](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13603) * 2021-06-21T09:01:53Z - [Create list and strategy for naming database clusters in Consul](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13601) * 2021-06-21T06:12:51Z - [create blackbox probe or pingdom check for docs search](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13599) * 2021-06-16T15:08:51Z - [The production sidekiq-catchall cluster members can become unresponsive without any alerting](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13577) ### Open Issue Stats * [Oncall issues](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=oncall) : **4** * [Change issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=change) : **2** * [Incident issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=incident) : **41** * [Access Request](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=access%20request) : **0** * [CorrectiveAction](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=corrective%20action) : **202** #### Open Change Issues <details> <summary>Show/Hide Table</summary> | Created | Summary | | ------- | ------- | | [2021-06-21T05:06:04Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4936) | Change healthcheck on https_git and websockets to be HTTP instead of TCP | | [2021-06-18T03:27:40Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4921) | Set ENABLE_RAILS_61_CONNECTION_HANDLING env variable to enable new Rails connection handling for Production | </details> #### Open Incident Issues <details> <summary>Show/Hide Table</summary> | Created | Summary | | ------- | ------- | | [2021-06-21T10:16:21Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4942) | 2021-06-21 Patch release failure - 13.12 stable branch failing integration tests | | [2021-06-21T08:43:18Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4940) | 2021-06-21: GKE unable to scale due to lack of SSD availability | | [2021-06-17T15:01:31Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4916) | 2021-06-17 CI Jobs fail with exit code 137 | | [2021-06-15T13:22:32Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4902) | 2021-06-15: Praefect replicate DB out of date, missing 4 days of data | | [2021-06-04T10:12:11Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4809) | 2021-06-04: GitLab Runner GPG key and passphrase leaked, need rotating | | [2021-05-27T15:16:07Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4756) | 2021-05-27: Prometheus has no targets | | [2021-05-27T14:54:37Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4755) | 2021-05-27: Prometheus has no targets | </details> #### Open Oncall Issues <details> <summary>Show/Hide Table</summary> | Created | Summary | | ------- | ------- | | [2020-12-18T22:29:14Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12200) | CI clones fail for repositories with a path ending in a period | | [2020-09-02T13:47:51Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11244) | disable-chef-client isn't preserved over reboots | | [2020-08-11T16:39:37Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11098) | Investigate slow child pipeline triggering on pre.gitlab.com | | [2020-03-30T13:38:11Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9660) | jobs.gitlab.com cert expired unnoticed on 2020-03-28 | </details>
issue