Weekly Reliability (SRE) Team Newsletter – Period: 2020-06-29 - 2020-07-06
Announcements
- We're asking SREs to volunteer to execute Postgres runbooks for upcoming scheduled demos. If you see or receive an invitation and are interested, please reach out to @albertoramos to coordinate. It's very valuable for us to receive feedback from SREs about the ease of understanding, execution, and completeness of the runbooks. Your perspectives are likely to be very different than those of a DBA/DBRE. See &252.
Team Updates
Core Infrastructure
- GKE Cluster upgrades have been going on - gprd will happen soon: delivery#889 (comment 372945586)
- 1st iteration of Vault cluster is up - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10200. We'll start incrementing on some initial secrets on that server soon. For questions - check with @devin and @ggillies
- Many Incident Reviews and associated reliability~11110745 issues on the radar - we'll be triaging those
- CI HAProxy nodes (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10069) were canned and are about to get decommissioned.
Datastores
- We continue with the Runbook simulations this week, now only SREs/DBREs taking on them (no Ongres). Ahmad, Henri and Alejandro will run them this week.
- The PG_repack DB change (aiming to reduce the bloat of DB indexes/tables) faced some challenges in staging (ruby error while testing) and @nels is working to progress that, production#1785 (closed).
- Switching from WAL-E to WAL-G on Patroni for DB backups and WAL shipping is currently tested in staging and shows promising results (reducing backup size by a factor of 3 using brotli compression) and will be rolled out to production this week or early next week. This will enable us to run backups from a replica, reducing the load on the primary node.
Observability
- We now have a GKE running Prometheus and Alertmanager
- pubsubbeat is running on GKE and sending metrics to GKE Prometheus! — https://prometheus-gke.ops.gitlab.net/graph?g0.range_input=1h&g0.expr=pubsubbeat_cpu_ticks_total&g0.tab=0
- After a few hiccups, Elasticsearch and Kibana were upgrade to 7.8 last week. We'll be downsizing the cluster once we have a steady baseline on performance — &267 (closed)
- Focus continues on building our long term strategy for logging — https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10095
On-Call During This Period
Schedule | Username |
---|---|
SRE 8 Hour | Alejandro Rodriguez |
SRE 8 Hour | Hendrik Meyer |
SRE 8 Hour | Craig Miskell |
SRE 8 Hour | Matt Smiley |
SRE 8 Hour | Graeme Gillies |
PagerDuty Incidents
* Number of incidents: **37**
Show/Hide Table
Created | Summary |
---|---|
2020-06-30T00:29:50Z | [22117] Firing 2 - IncreasedErrorRateOtherBackends |
2020-06-30T00:37:01Z | [22118] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2020-06-30T01:30:05Z | [22120] Firing 1 - Increased Error Rate Across Fleet |
2020-06-30T01:47:42Z | [22121] Firing 1 - Last WALE backup was seen 20m 4s ago. |
2020-06-30T17:51:21Z | [22127] gitlab.net zone has elevated HTTP 5xx error rate |
2020-07-01T03:30:35Z | [22130] Firing 1 - Increased Server Response Errors |
2020-07-01T08:19:50Z | [22131] Firing 1 - Alertmanager is failing sending notifications |
2020-07-01T08:19:51Z | [22132] Firing 1 - Alertmanager is failing sending notifications |
2020-07-01T09:02:01Z | [22134] Shared-runners jobs piling up |
2020-07-01T09:02:01Z | [22133] Shared-runners jobs piling up |
2020-07-01T14:59:20Z | [22136] Firing 2 - AlertmanagerNotificationsFailing |
2020-07-01T14:59:20Z | [22137] Firing 2 - AlertmanagerNotificationsFailing |
2020-07-01T15:48:27Z | [22139] Firing 1 - Last WAL was archived 20m 14s ago. |
2020-07-01T21:00:32Z | [22142] Firing 1 - The Disk Utilization per Device per Node resource of the ops-runner service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2020-07-01T23:50:33Z | [22146] Firing 1 - The Disk Utilization per Device per Node resource of the ops-runner service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2020-07-02T00:36:32Z | [22148] Firing 1 - The Disk Utilization per Device per Node resource of the ops-runner service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2020-07-02T15:38:32Z | [22153] 500 errors during deployment |
2020-07-03T02:11:12Z | [22157] Firing 1 - customers.gitlab.com is down for 2 minutes |
2020-07-03T02:11:13Z | [22158] Firing 1 - customers.gitlab.com is not responding correctly for 2 minutes |
2020-07-03T09:02:47Z | [22159] production#2367 (closed) |
2020-07-03T09:24:32Z | [22160] Firing 1 - The Disk Utilization per Device per Node resource of the console-node service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2020-07-03T09:28:23Z | [22161] Firing 1 - 5% disk space left |
2020-07-03T12:39:47Z | [22162] Firing 1 - The Disk Utilization per Device per Node resource of the console-node service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. |
2020-07-04T00:04:14Z | [22167] Pingdom check check:https://snowplow.trx.gitlab.net/health is down |
2020-07-04T14:33:58Z | [22168] Firing 1 - Last WAL was archived 20m 13s ago. |
2020-07-04T20:00:50Z | [22169] Firing 1 - Increased Error Rate Across Fleet |
2020-07-04T20:00:51Z | [22170] Firing 1 - High Error Rate on Front End Web |
2020-07-04T20:01:05Z | [22171] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2020-07-04T20:01:26Z | [22172] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2020-07-04T20:03:30Z | [22174] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down |
2020-07-06T00:30:54Z | [22179] Firing 1 - Large amount of Sidekiq Queued jobs |
2020-07-06T08:31:05Z | [22180] Firing 1 - Large amount of Sidekiq Queued jobs |
2020-07-06T13:01:37Z | [22181] Firing 1 - prometheus is unreachable |
2020-07-06T15:27:39Z | [22184] Firing 1 - Large amount of Sidekiq Queued jobs |
2020-07-06T17:50:20Z | [22187] Firing 2 - IncreasedBackendConnectionErrors |
2020-07-06T17:50:20Z | [22188] Firing 2 - IncreasedServerConnectionErrors |
2020-07-06T23:16:10Z | [22191] Pingdom check check:https://version.gitlab.com/ is down |
7 Day Issue Stats
- Oncall issues : 1
- Access Request : 0
- Change Issues : 0
- Incident Issues : 10
- CorrectiveAction Issues : 1
Change Issues
Incident Issues
- 2020-07-04T20:08:17Z - 2020-07-04 Spike in 500 erros | ~S2 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2370
- 2020-07-03T08:03:54Z - 2020-07-03: Chatops runner is not responding | ~S4 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2367
- 2020-07-03T02:33:27Z - 2020-07-03: Triggered #22158: Firing 1 - customers.gitlab.com is not responding correctly for 2 | ~S3 | ~"Service::Customers" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2366
- 2020-07-02T16:29:51Z - 2020-07-02: Extreme Increase in Egress Cost to China - July 1st | ~S3 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2364
- 2020-07-02T15:38:31Z - 500 errors during deployment | ~S4 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2363
- 2020-07-01T09:02:00Z - 2020-07-01: Connectivity issues to Docker Hub causing stalled CI jobs on shared runners | ~S2 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2357
- 2020-06-30T17:51:20Z - gitlab.net zone has elevated HTTP 5xx error rate | ~S3 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2354
- 2020-06-30T17:14:45Z - 2020-06-30 Kibana upgrade from 7.5 to 7.8 failed and we lost the Kibana index (which contains index patterns, dashboards, saved searches, etc) | ~S3 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2355
- 2020-06-30T08:42:56Z - 2020-06-30: Connectivity issues to Docker Hub causing stalled CI jobs on shared runners | ~S2 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2351
- 2020-06-30T00:56:59Z - 2020-06-30: #22117: Firing 2 - IncreasedErrorRateOtherBackends | ~S2 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2349
CorrectiveAction Issues
- 2020-07-06T11:27:53Z - Create a stackdriver exporter for the gitlab-ci project
- 2020-07-02T07:14:03Z - customers.GitLab.com should have Prometheus monitoring
- 2020-07-01T14:12:45Z - Use Google Container Registry to alleviate pressure on Docker Hub
Open Issue Stats
- Oncall issues : 4
- Change issues : 3
- Incident issues : 22
- Access Request : 4
- CorrectiveAction : 97
Open Change Issues
Show/Hide Table
Created | Summary |
---|---|
2020-06-29T04:26:00Z | Migrate large projects off file-47-stor-gprd to file-07-stor-gprd |
2020-06-25T15:54:03Z | Migrate large projects off file-45-stor-gprd to other less-used shards (for example: file-05-stor-gprd) |
2020-06-08T22:05:46Z | Migrate large projects off file-42-stor-gprd to file-02-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Summary |
---|
Open Oncall Issues
Show/Hide Table
Created | Summary |
---|---|
2020-05-25T05:05:45Z | Archived repository missing |
2020-03-30T13:38:11Z | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2019-10-23T13:05:14Z | cleanup registered nodes in chef |
2019-05-15T19:10:07Z | customers.gitlab.com - out of disk space |
This issue was automatically generated using oncall-robot-assistant