Reliability Team Staff Report for period: 2020-06-16 - 2020-06-23
## Director's Notes * The simulations/demos which the Datastores team is running are incredibly important to our capabilities as an SRE org, please attend if you can and watch the recording if you cannot. They will be detailing ~25 new runbooks related to support of our Postgres infrastructure. In your next sprint planning sessions, please discuss preserving time to allow for review of these new runbooks. * Please make sure you're taking enough time off for yourself. I've fallen into the trap of "well can't really travel, so why take time off?" and have had to course-correct that with myself and now have planned some time off and away from work. Even if the current situation limits options on how to use time off, please consider taking time off to recharge. You all do hard and challenging work and it is fair to take a breather away from that work. ## Team Updates ### Core Infrastructure * We would like to collaborate with Delivery on https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9560 in the coming July work. We have discovered with console work and Vault work that we'll have a similar need to interact with Vault in a safe way that is much with the need for kubectl and helm. ### Datastores * DB runbooks work continue to progress, see the ones in flight [here](https://gitlab.com/groups/gitlab-com/gl-infra/-/boards/1688503?scope=all&utf8=%E2%9C%93&state=opened&milestone_title=Datastores%20Team%20-%20June%202020&label_name[]=team%3A%3ADatastores&search=%5Brunbook%5D). * We are starting their test with the rest of the Reliability team members from tomorrow Wednesday 24th June, 4pm UTC. These calls will be recorded. * More tests will happen on a daily basis (we have ~24 runbooks to test!) * Quick update on the DB Sharding working group: We are not sharding the DB anymore, for now. Transitioned to “Scaling working group”, where we’ll talk about some of these points: * Running several gitlab.com instances in various world geographies. Europe/Germany/China to start with? Main limitation will be the costs of all this infra. * Improve HA by enabling Failover between these locations. * Keep a very close eye on the DB capacity wall (hitting us maybe in 18-24 months from now), and have the right scalability strategy (sharding will be part of it, most likely) ready to start implementing. ### Observability This week we're continuing to make strides migrating o11y services from VMs to k8s: - pubsubbeat: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8962 - Alertmanager: https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/234 - Grafana (prep): https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/7788 We've recently become the custodians of the [pubsubbeat](https://gitlab.com/gitlab-org/pubsubbeat) open-source project. After we forked it to GitLab, Google archived their GitHub project &mdash; we're maintainers! Additionally, we're continuing to focus on a https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10095 to determine ways to diversify our stack and how to best utilize SaaS and self-managed components in our pipeline. ## On-Call Schedule Adjustments Please see https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10660 for a discussion on proposed on-call adjustments. ## On-Call During this Period Devin Sylva, Craig Barrett, Henri Philipps, & Nels Nelson ## PagerDuty Incidents <details> * Number of incidents: **58** <summary>Show/Hide Table</summary> | Created | Summary | | ------ | ------- | | [2020-06-16T14:53:45Z](https://gitlab.pagerduty.com/incidents/P10DSOS) | [21941] Firing 1 - monitor.gitlab.net is down | | [2020-06-16T20:13:07Z](https://gitlab.pagerduty.com/incidents/PLV7283) | [21943] Firing 1 - Increased Server Response Errors | | [2020-06-16T21:57:54Z](https://gitlab.pagerduty.com/incidents/P8SL6TS) | [21944] Firing 1 - Large amount of Sidekiq Queued jobs | | [2020-06-16T23:44:54Z](https://gitlab.pagerduty.com/incidents/P3GICRV) | [21946] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down | | [2020-06-17T05:58:09Z](https://gitlab.pagerduty.com/incidents/PQIDD18) | [21947] Firing 1 - Large amount of Sidekiq Queued jobs | | [2020-06-17T08:50:08Z](https://gitlab.pagerduty.com/incidents/PE2QJTR) | [21949] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-17T08:50:08Z](https://gitlab.pagerduty.com/incidents/P1D8EVI) | [21950] Firing 1 - Last successful walg basebackup was seen 11m 17s ago. | | [2020-06-17T15:21:40Z](https://gitlab.pagerduty.com/incidents/PFF16BE) | [21951] Firing 1 - Gitaly is down on file-cny-01-stor-gprd.c.gitlab-production.internal | | [2020-06-17T16:51:59Z](https://gitlab.pagerduty.com/incidents/P16EKWG) | [21953] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-17T16:51:59Z](https://gitlab.pagerduty.com/incidents/PJRVEEL) | [21954] Firing 1 - Last successful walg basebackup was seen 11m 25s ago. | | [2020-06-17T18:38:04Z](https://gitlab.pagerduty.com/incidents/PG59HMO) | [21955] Firing 1 - Gitaly error rate is too high: 7.11 | | [2020-06-17T19:26:07Z](https://gitlab.pagerduty.com/incidents/POTEJ4U) | [21956] Firing 1 - Gitaly error rate is too high: 7.58 | | [2020-06-17T23:37:43Z](https://gitlab.pagerduty.com/incidents/PDDG02A) | [21958] Firing 1 - Chef client failures have reached critical levels | | [2020-06-18T11:12:50Z](https://gitlab.pagerduty.com/incidents/PC1GHZO) | [21963] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m | | [2020-06-18T13:31:52Z](https://gitlab.pagerduty.com/incidents/PDX7V5U) | [21964] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m | | [2020-06-18T13:42:21Z](https://gitlab.pagerduty.com/incidents/P71FWS8) | [21965] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m | | [2020-06-18T13:42:35Z](https://gitlab.pagerduty.com/incidents/PP7DSHM) | [21966] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m | | [2020-06-18T13:47:14Z](https://gitlab.pagerduty.com/incidents/P6UGVZM) | [21967] Firing 1 - Last WALE backup was seen 20m 8s ago. | | [2020-06-18T13:51:50Z](https://gitlab.pagerduty.com/incidents/P78HGWF) | [21968] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m | | [2020-06-18T14:13:45Z](https://gitlab.pagerduty.com/incidents/P2083FE) | [21969] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster | | [2020-06-18T14:17:50Z](https://gitlab.pagerduty.com/incidents/PM32EGK) | [21970] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m | | [2020-06-18T14:33:45Z](https://gitlab.pagerduty.com/incidents/P4PR9GZ) | [21971] Firing 1 - Failed to collect Redis metrics Check the status of redis on `redis-sidekiq-03-db-gprd.c.gitlab-production.internal:9121` with `gitlab-ctl status`. | | [2020-06-18T17:19:32Z](https://gitlab.pagerduty.com/incidents/PXXF4KY) | [21972] Firing 1 - The Disk Utilization per Device per Node resource of the gitaly service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. | | [2020-06-18T22:00:33Z](https://gitlab.pagerduty.com/incidents/PPD4PHT) | [21974] Multiple full backups are created daily by WAL-E (should be once per day) | | [2020-06-19T00:56:59Z](https://gitlab.pagerduty.com/incidents/PRQO7C6) | [21975] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-19T08:58:22Z](https://gitlab.pagerduty.com/incidents/PUON0OO) | [21977] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-19T09:20:02Z](https://gitlab.pagerduty.com/incidents/P6Z2GB5) | [21978] Firing 1 - The Disk Utilization per Device per Node resource of the gitaly service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. | | [2020-06-19T16:58:41Z](https://gitlab.pagerduty.com/incidents/PW9MY3N) | [21981] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-19T17:20:17Z](https://gitlab.pagerduty.com/incidents/P3C977Z) | [21982] Firing 1 - The Disk Utilization per Device per Node resource of the gitaly service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. | | [2020-06-19T19:40:28Z](https://gitlab.pagerduty.com/incidents/P0YIHPS) | [21983] Firing 1 - Chef client failures have reached critical levels | | [2020-06-19T20:09:28Z](https://gitlab.pagerduty.com/incidents/P397C9Q) | [21984] Firing 1 - Chef client failures have reached critical levels | | [2020-06-20T01:00:09Z](https://gitlab.pagerduty.com/incidents/PAU7EQQ) | [21986] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-20T01:00:09Z](https://gitlab.pagerduty.com/incidents/PWHJDB0) | [21987] Firing 1 - Last successful walg basebackup was seen 12m 21s ago. | | [2020-06-20T09:02:01Z](https://gitlab.pagerduty.com/incidents/PE7F8WY) | [21989] Firing 1 - Last successful walg basebackup was seen 12m 30s ago. | | [2020-06-20T09:02:01Z](https://gitlab.pagerduty.com/incidents/PAYU7H8) | [21990] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-20T16:07:40Z](https://gitlab.pagerduty.com/incidents/P0L7YOS) | [21995] Firing 1 - Postgres transactions showing high rate of statement timeouts | | [2020-06-20T16:16:54Z](https://gitlab.pagerduty.com/incidents/P4J7SPX) | [21996] Firing 1 - Large amount of Sidekiq Queued jobs | | [2020-06-20T17:03:22Z](https://gitlab.pagerduty.com/incidents/P8ZU9E8) | [21998] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-20T17:03:22Z](https://gitlab.pagerduty.com/incidents/PDD0QN1) | [21997] Firing 1 - Last successful walg basebackup was seen 12m 38s ago. | | [2020-06-21T01:03:41Z](https://gitlab.pagerduty.com/incidents/PGFVO5M) | [22001] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-21T01:03:41Z](https://gitlab.pagerduty.com/incidents/PSMFRRY) | [22000] Firing 1 - Last successful walg basebackup was seen 12m 46s ago. | | [2020-06-21T09:05:09Z](https://gitlab.pagerduty.com/incidents/P2FMUGC) | [22002] Firing 1 - Last successful walg basebackup was seen 12m 54s ago. | | [2020-06-21T09:05:09Z](https://gitlab.pagerduty.com/incidents/PLC8V3J) | [22003] Firing 1 - Redis cluster gitlab is missing instances | | [2020-06-21T17:07:00Z](https://gitlab.pagerduty.com/incidents/PELU5VT) | [22005] Firing 1 - Last successful walg basebackup was seen 13m 2s ago. | | [2020-06-21T20:43:50Z](https://gitlab.pagerduty.com/incidents/PQ20E87) | [22006] Firing 2 - IncreasedServerResponseErrors | | [2020-06-21T20:59:05Z](https://gitlab.pagerduty.com/incidents/PYUJ25O) | [22007] Firing 2 - IncreasedServerResponseErrors | | [2020-06-21T21:03:20Z](https://gitlab.pagerduty.com/incidents/PA1DG4H) | [22008] Firing 2 - IncreasedServerResponseErrors | | [2020-06-21T21:09:05Z](https://gitlab.pagerduty.com/incidents/PT74HAV) | [22009] Firing 1 - Increased Server Response Errors | | [2020-06-21T22:57:24Z](https://gitlab.pagerduty.com/incidents/P7RKJH0) | [22010] Firing 1 - Large amount of Sidekiq Queued jobs | | [2020-06-22T01:17:50Z](https://gitlab.pagerduty.com/incidents/P6TH0Y5) | [22011] Firing 3 - IncreasedServerResponseErrors | | [2020-06-22T02:49:22Z](https://gitlab.pagerduty.com/incidents/PJBU7SB) | [22012] Firing 1 - Increased Server Response Errors | | [2020-06-22T03:14:50Z](https://gitlab.pagerduty.com/incidents/PIYRZID) | [22013] Firing 1 - Increased Server Response Errors | | [2020-06-22T03:19:50Z](https://gitlab.pagerduty.com/incidents/PO466GY) | [22014] Firing 1 - Increased Server Response Errors | | [2020-06-22T03:24:50Z](https://gitlab.pagerduty.com/incidents/PDUJUQO) | [22015] Firing 3 - IncreasedServerResponseErrors | | [2020-06-22T06:57:39Z](https://gitlab.pagerduty.com/incidents/PCSVNEY) | [22016] Firing 1 - Large amount of Sidekiq Queued jobs | | [2020-06-22T14:57:54Z](https://gitlab.pagerduty.com/incidents/PLCPL2W) | [22018] Firing 1 - Large amount of Sidekiq Queued jobs | | [2020-06-22T15:37:50Z](https://gitlab.pagerduty.com/incidents/PDVA7IL) | [22020] Firing 1 - Increased Server Response Errors | | [2020-06-23T01:11:59Z](https://gitlab.pagerduty.com/incidents/P0DKBS2) | [22021] Firing 1 - Last successful walg basebackup was seen 13m 34s ago. | </details> ### 7 Day Issue Stats * Oncall issues : **2** * Access Request : **0** * Change Issues : **3** * Incident Issues : **19** * CorrectiveAction Issues : **0** #### Change Issues * 2020-06-20T10:20:51Z - [Repository migration on gitlab.com (nfs-file42)](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2303) - glopezfernandez * 2020-06-20T10:15:26Z - [Repository migration on gitlab.com (nfs-file45)](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2302) - glopezfernandez * 2020-06-20T10:12:29Z - [Repository migration on gitlab.com](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2301) - unassigned #### Incident Issues <details> <summary>Show/Hide Table</summary> * 2020-06-22T19:23:59Z - [Elasticsearch cluster not responding on 2020-06-22 around 17:00 UTC](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2316) - unassigned | production~3760141 | production~13295528 | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2316` * 2020-06-22T16:01:07Z - [2020-06-22 - Increased Server Response Errors](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2313) - nnelson | production~3760142 | production~13051215 | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2313` * 2020-06-22T10:58:44Z - [2020-06-22: Quality Engineering experiencing SSL Cert issues with qa-tunnel.gitlab.info SSL cert](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2310) - hphilipps | production~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2310` * 2020-06-22T09:06:33Z - [Increased Sidekiq mailers error rate](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2309) - craigf | production~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2309` * 2020-06-22T03:27:15Z - [Intermittent error spikes for pages backend](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2308) - craig | production~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2308` * 2020-06-21T20:48:32Z - [2020-06-21 - Increased server response errors](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2306) - nnelson | production~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2306` * 2020-06-20T16:20:33Z - [2020-06-20 - Postgres transactions showing high rate of statement timeouts](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2305) - nnelson | production~3760141 | production~12899300 | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2305` * 2020-06-20T13:35:40Z - [Elevated error rate for web service](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2304) - hphilipps | production~3760142 | production~13297951 | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2304` * 2020-06-19T19:46:33Z - [2020-06-19 - Chef client failures have reached critical levels](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2300) - nnelson | production~3760142 | production~13297953 | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2300` * 2020-06-18T21:55:14Z - [2020-06-18 - Multiple full backups are created daily by WAL-E (should be once per day)](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2298) - nnelson | production~3760141 | production~13297239 | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2298` * 2020-06-18T14:40:59Z - [2020-06-18: Redis down on redis-sidekiq-03](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2296) - hphilipps | production~3760141 | production~12971801 | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2296` * 2020-06-17T18:41:48Z - [2020-06-17 - Gitaly error rate is too high: 7.11 (for file-praefect-02-stor-gprd.c.gitlab-production.internal)](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2294) - nnelson | production~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2294` * 2020-06-17T16:59:01Z - [2020-06-17 - [Note: `ops` environment only] Redis cluster gitlab is missing instances](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2293) - nnelson | production~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2293` * 2020-06-17T11:11:22Z - [2020-06-17: errors on /api/v4/internal/allowed](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2291) - unassigned | production~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2291` * 2020-06-17T10:08:04Z - [2020-06-17: Cannot create new sessions on GitLab.com](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2290) - unassigned | production~3760141 | production~13297951 | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2290` * 2020-06-17T00:00:00Z - [2020-06-17: Elevated error rates for praefect](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2289) - craig | production~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2289` * 2020-06-16T22:01:36Z - [2020-06-16 - Large amount of Sidekiq Queued jobs](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2288) - nnelson | production~3760142 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2288` * 2020-06-16T15:42:20Z - [2020-06-16: Python dependency not installed in our CNG images prevents some use of RST file rendering](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2285) - stanhu | production~3760141 | production~12899300 | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2285` * 2020-06-16T15:25:34Z - [2020-06-16: Unable to deploy to production due missing configuration](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2284) - marin | production~3760141 | | `https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2284` </details> #### CorrectiveAction Issues * 2020-06-22T11:23:28Z - [Investigate number of concurrent connections made by some of our services to the same external addresses](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10627) - unassigned * 2020-06-17T21:27:47Z - [Remove storage shards from deploy nodes' Praefect configuration](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10590) - unassigned * 2020-06-17T13:38:24Z - [Review cache TTLs (and global default TTL)](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10580) - unassigned * 2020-06-16T16:40:37Z - ["ci_jwt_signing_key" needs to be configured in all GitLab environments](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10573) - skarbek ### Open Issue Stats * [Oncall issues](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=oncall) : **5** * [Change issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=change) : **9** * [Incident issues](https://gitlab.com/gitlab-com/production/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=incident) : **16** * [Access Request](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=access%20request) : **4** * [CorrectiveAction](https://gitlab.com/gitlab-com/infrastructure/issues?scope=all&utf8=%E2%9C%93&state=opened&label_name[]=corrective%20action) : **97** #### Open Change Issues <details> <summary>Show/Hide Table</summary> | Created | Assignee | Summary | | ------ | -------- | ------- | | [2020-06-20T10:20:51Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2303) | glopezfernandez | Repository migration on gitlab.com (nfs-file42) | | [2020-06-20T10:15:26Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2302) | glopezfernandez | Repository migration on gitlab.com (nfs-file45) | | [2020-06-20T10:12:29Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2301) | unassigned | Repository migration on gitlab.com | | [2020-06-12T09:25:28Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2272) | unassigned | Repository migration on gitlab.com (nfs-file09) | | [2020-06-12T09:25:17Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2271) | unassigned | Repository migration on gitlab.com (nfs-file08) | | [2020-06-12T09:25:06Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2270) | unassigned | Repository migration on gitlab.com (nfs-file07) | | [2020-06-11T20:51:56Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2266) | unassigned | Repository migration on gitlab.com (nfs-file03) | | [2020-06-08T22:05:46Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2246) | nnelson | Migrate large projects off file-42-stor-gprd to file-02-stor-gprd | | [2020-03-26T19:16:25Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1847) | alejandro | Rotate credentials for user `gitlab-superuser` | </details> #### Open Incident Issues <details> <summary>Show/Hide Table</summary> | Created | Assignee | Summary | | ------ | -------- | ------- | | [2020-06-22T19:23:59Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2316) | unassigned | Elasticsearch cluster not responding on 2020-06-22 around 17:00 UTC | | [2020-06-22T09:06:33Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2309) | craigf | Increased Sidekiq mailers error rate | | [2020-06-22T03:27:15Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2308) | craig | Intermittent error spikes for pages backend | | [2020-06-18T21:55:14Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2298) | nnelson | 2020-06-18 - Multiple full backups are created daily by WAL-E (should be once per day) | | [2020-06-18T14:40:59Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2296) | hphilipps | 2020-06-18: Redis down on redis-sidekiq-03 | | [2020-06-17T10:08:04Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2290) | unassigned | 2020-06-17: Cannot create new sessions on GitLab.com | | [2020-06-16T15:42:20Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2285) | stanhu | 2020-06-16: Python dependency not installed in our CNG images prevents some use of RST file rendering | | [2020-06-16T15:25:34Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2284) | marin | 2020-06-16: Unable to deploy to production due missing configuration | | [2020-06-10T13:53:13Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2259) | ahanselka | 2020-06-10: Elevated web latency | | [2020-06-09T11:27:23Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2251) | nolith | 2020-06-09: post-deployment migration failure | | [2020-06-08T04:08:25Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2242) | unassigned | 2020-06-08: High rate of canary errors: DDoS | | [2020-06-05T13:39:49Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2239) | unassigned | 2020-06-05: increased error rates on the web service | | [2020-06-05T07:58:05Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2236) | unassigned | 2020-06-05: surge in authorized_project_update jobs is saturating catchall workers | | [2020-06-04T03:17:59Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2223) | cmiskell | 2020-06-04 Large load spike on API fleet causing response degradation | | [2020-05-29T09:07:54Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2203) | nolith | 2020-05-29: HTTP 401s on various components of the GitLab UI | | [2020-05-29T05:21:12Z](https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2201) | ggillies | 2020-05-29: gitlab.com is down | </details> #### Open Oncall Issues <details> <summary>Show/Hide Table</summary> | Created | Assignee | Summary | | ------ | -------- | ------- | | [2020-06-19T13:23:10Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10614) | hphilipps | Write runbook for Project Export | | [2020-06-10T19:02:48Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10519) | unassigned | Import request (for alex-solutions/core): alex-app | | [2020-05-25T05:05:45Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10319) | albertoramos | Archived repository missing | | [2020-03-30T13:38:11Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9660) | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 | | [2019-10-23T13:05:14Z](https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8249) | cmcfarland | cleanup registered nodes in chef | </details> _This issue was automatically generated using [oncall-robot-assistant](https://gitlab.com/gitlab-com/gl-infra/oncall-robot-assistant)_
issue