Weekly Reliability (SRE) Team Newsletter – Period: 2020-06-29 - 2020-07-06

Announcements

We're asking SREs to volunteer to execute Postgres runbooks for upcoming scheduled demos. If you see or receive an invitation and are interested, please reach out to @albertoramos to coordinate. It's very valuable for us to receive feedback from SREs about the ease of understanding, execution, and completeness of the runbooks. Your perspectives are likely to be very different than those of a DBA/DBRE. See &252.

Team Updates

Core Infrastructure

GKE Cluster upgrades have been going on - gprd will happen soon: delivery#889 (comment 372945586)
1st iteration of Vault cluster is up - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10200. We'll start incrementing on some initial secrets on that server soon. For questions - check with @devin and @ggillies
Many Incident Reviews and associated reliability~11110745 issues on the radar - we'll be triaging those
CI HAProxy nodes (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10069) were canned and are about to get decommissioned.

Datastores

We continue with the Runbook simulations this week, now only SREs/DBREs taking on them (no Ongres). Ahmad, Henri and Alejandro will run them this week.
The PG_repack DB change (aiming to reduce the bloat of DB indexes/tables) faced some challenges in staging (ruby error while testing) and @nels is working to progress that, production#1785 (closed).
Switching from WAL-E to WAL-G on Patroni for DB backups and WAL shipping is currently tested in staging and shows promising results (reducing backup size by a factor of 3 using brotli compression) and will be rolled out to production this week or early next week. This will enable us to run backups from a replica, reducing the load on the primary node.

Observability

We now have a GKE running Prometheus and Alertmanager
pubsubbeat is running on GKE and sending metrics to GKE Prometheus! — https://prometheus-gke.ops.gitlab.net/graph?g0.range_input=1h&g0.expr=pubsubbeat_cpu_ticks_total&g0.tab=0
After a few hiccups, Elasticsearch and Kibana were upgrade to 7.8 last week. We'll be downsizing the cluster once we have a steady baseline on performance — &267 (closed)
Focus continues on building our long term strategy for logging — https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10095

On-Call During This Period

Schedule	Username
SRE 8 Hour	Alejandro Rodriguez
SRE 8 Hour	Hendrik Meyer
SRE 8 Hour	Craig Miskell
SRE 8 Hour	Matt Smiley
SRE 8 Hour	Graeme Gillies

PagerDuty Incidents

* Number of incidents: **37**

Show/Hide Table

Created	Summary
2020-06-30T00:29:50Z	[22117] Firing 2 - IncreasedErrorRateOtherBackends
2020-06-30T00:37:01Z	[22118] Firing 1 - The Puma Worker Saturation per Node resource of the api service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2020-06-30T01:30:05Z	[22120] Firing 1 - Increased Error Rate Across Fleet
2020-06-30T01:47:42Z	[22121] Firing 1 - Last WALE backup was seen 20m 4s ago.
2020-06-30T17:51:21Z	[22127] gitlab.net zone has elevated HTTP 5xx error rate
2020-07-01T03:30:35Z	[22130] Firing 1 - Increased Server Response Errors
2020-07-01T08:19:50Z	[22131] Firing 1 - Alertmanager is failing sending notifications
2020-07-01T08:19:51Z	[22132] Firing 1 - Alertmanager is failing sending notifications
2020-07-01T09:02:01Z	[22134] Shared-runners jobs piling up
2020-07-01T09:02:01Z	[22133] Shared-runners jobs piling up
2020-07-01T14:59:20Z	[22136] Firing 2 - AlertmanagerNotificationsFailing
2020-07-01T14:59:20Z	[22137] Firing 2 - AlertmanagerNotificationsFailing
2020-07-01T15:48:27Z	[22139] Firing 1 - Last WAL was archived 20m 14s ago.
2020-07-01T21:00:32Z	[22142] Firing 1 - The Disk Utilization per Device per Node resource of the ops-runner service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2020-07-01T23:50:33Z	[22146] Firing 1 - The Disk Utilization per Device per Node resource of the ops-runner service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2020-07-02T00:36:32Z	[22148] Firing 1 - The Disk Utilization per Device per Node resource of the ops-runner service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2020-07-02T15:38:32Z	[22153] 500 errors during deployment
2020-07-03T02:11:12Z	[22157] Firing 1 - customers.gitlab.com is down for 2 minutes
2020-07-03T02:11:13Z	[22158] Firing 1 - customers.gitlab.com is not responding correctly for 2 minutes
2020-07-03T09:02:47Z	[22159] production#2367 (closed)
2020-07-03T09:24:32Z	[22160] Firing 1 - The Disk Utilization per Device per Node resource of the console-node service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2020-07-03T09:28:23Z	[22161] Firing 1 - 5% disk space left
2020-07-03T12:39:47Z	[22162] Firing 1 - The Disk Utilization per Device per Node resource of the console-node service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
2020-07-04T00:04:14Z	[22167] Pingdom check check:https://snowplow.trx.gitlab.net/health is down
2020-07-04T14:33:58Z	[22168] Firing 1 - Last WAL was archived 20m 13s ago.
2020-07-04T20:00:50Z	[22169] Firing 1 - Increased Error Rate Across Fleet
2020-07-04T20:00:51Z	[22170] Firing 1 - High Error Rate on Front End Web
2020-07-04T20:01:05Z	[22171] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down
2020-07-04T20:01:26Z	[22172] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down
2020-07-04T20:03:30Z	[22174] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down
2020-07-06T00:30:54Z	[22179] Firing 1 - Large amount of Sidekiq Queued jobs
2020-07-06T08:31:05Z	[22180] Firing 1 - Large amount of Sidekiq Queued jobs
2020-07-06T13:01:37Z	[22181] Firing 1 - prometheus is unreachable
2020-07-06T15:27:39Z	[22184] Firing 1 - Large amount of Sidekiq Queued jobs
2020-07-06T17:50:20Z	[22187] Firing 2 - IncreasedBackendConnectionErrors
2020-07-06T17:50:20Z	[22188] Firing 2 - IncreasedServerConnectionErrors
2020-07-06T23:16:10Z	[22191] Pingdom check check:https://version.gitlab.com/ is down

7 Day Issue Stats

Oncall issues : 1
Access Request : 0
Change Issues : 0
Incident Issues : 10
CorrectiveAction Issues : 1

Change Issues

Incident Issues

2020-07-04T20:08:17Z - 2020-07-04 Spike in 500 erros | ~S2 | ServiceWeb | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2370
2020-07-03T08:03:54Z - 2020-07-03: Chatops runner is not responding | ~S4 | ServiceCI Runners | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2367
2020-07-03T02:33:27Z - 2020-07-03: Triggered #22158: Firing 1 - customers.gitlab.com is not responding correctly for 2 | ~S3 | ~"Service::Customers" | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2366
2020-07-02T16:29:51Z - 2020-07-02: Extreme Increase in Egress Cost to China - July 1st | ~S3 | ServiceCI Runners | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2364
2020-07-02T15:38:31Z - 500 errors during deployment | ~S4 | ServiceAPI | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2363
2020-07-01T09:02:00Z - 2020-07-01: Connectivity issues to Docker Hub causing stalled CI jobs on shared runners | ~S2 | ServiceCI Runners | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2357
2020-06-30T17:51:20Z - gitlab.net zone has elevated HTTP 5xx error rate | ~S3 | ~"Service::ELK" | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2354
2020-06-30T17:14:45Z - 2020-06-30 Kibana upgrade from 7.5 to 7.8 failed and we lost the Kibana index (which contains index patterns, dashboards, saved searches, etc) | ~S3 | ~"Service::ELK" | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2355
2020-06-30T08:42:56Z - 2020-06-30: Connectivity issues to Docker Hub causing stalled CI jobs on shared runners | ~S2 | ServiceCI Runners | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2351
2020-06-30T00:56:59Z - 2020-06-30: #22117: Firing 2 - IncreasedErrorRateOtherBackends | ~S2 | ServiceAPI | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2349

CorrectiveAction Issues

2020-07-06T11:27:53Z - Create a stackdriver exporter for the gitlab-ci project
2020-07-02T07:14:03Z - customers.GitLab.com should have Prometheus monitoring
2020-07-01T14:12:45Z - Use Google Container Registry to alleviate pressure on Docker Hub

Open Issue Stats

Open Change Issues

Show/Hide Table

Created	Summary
2020-06-29T04:26:00Z	Migrate large projects off file-47-stor-gprd to file-07-stor-gprd
2020-06-25T15:54:03Z	Migrate large projects off file-45-stor-gprd to other less-used shards (for example: file-05-stor-gprd)
2020-06-08T22:05:46Z	Migrate large projects off file-42-stor-gprd to file-02-stor-gprd

Open Incident Issues

Show/Hide Table

Created	Summary

Open Oncall Issues

Show/Hide Table

Created	Summary
2020-05-25T05:05:45Z	Archived repository missing
2020-03-30T13:38:11Z	jobs.gitlab.com cert expired unnoticed on 2020-03-28
2019-10-23T13:05:14Z	cleanup registered nodes in chef
2019-05-15T19:10:07Z	customers.gitlab.com - out of disk space

This issue was automatically generated using oncall-robot-assistant

Edited 4 years ago by AnthonySandoval

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Weekly Reliability (SRE) Team Newsletter – Period: 2020-06-29 - 2020-07-06

Announcements

Team Updates

Core Infrastructure

Datastores

Observability

On-Call During This Period

PagerDuty Incidents

7 Day Issue Stats

Change Issues

Incident Issues

CorrectiveAction Issues

Open Issue Stats

Open Change Issues

Open Incident Issues

Open Oncall Issues

Child items ...

Activity

Weekly Reliability (SRE) Team Newsletter – Period: 2020-06-29 - 2020-07-06

Announcements

Team Updates

Core Infrastructure

Datastores

Observability

On-Call During This Period

PagerDuty Incidents

7 Day Issue Stats

Change Issues

Incident Issues

CorrectiveAction Issues

Open Issue Stats

Open Change Issues

Open Incident Issues

Open Oncall Issues

Relates to

Activity