Weekly Reliability (SRE) Team Newsletter – On-call Period: 2022-03-15 - 2022-03-22
Announcements
Engineering Week in Review Highlights:
Team Updates
On-Call During This Period
| Schedule | Username |
|---|---|
| SRE 8-hour Americas | Cameron McFarland |
| SRE 8-hour Americas | Marcel Chacon |
| SRE 8-hour APAC | Craig Barrett |
| SRE 8-hour EMEA | Alejandro Rodriguez |
| SRE 8-hour EMEA | Igor Wiedler |
PagerDuty Incidents
See the 1 week report for acknowledged PD pages (long-term trend)
Alerts Volume
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 19
- Incident Issues : 43
- CorrectiveAction Issues : 0
Change Issues
- 2022-03-21T00:02:14Z - 2022-03-21: Grow gitlab-logs-prod warm tier
- 2022-03-20T23:54:52Z - 2022-03-21: Enable autoscaling for pubsubbeat
- 2022-03-18T19:25:49Z - Removal of foreign key fk_e4ef9c2f27 on PRD
- 2022-03-18T19:19:21Z - Manually mark migration as complete to fix deploy
- 2022-03-18T11:55:06Z - Set up ops staging environment
- 2022-03-17T17:28:34Z - Import projects into project_build_artifacts_size_refreshes
- 2022-03-17T13:31:25Z - 2022-03-18: Upgrade Prometheus servers in gprd GKE Clusters
- 2022-03-17T11:50:33Z - Adjust batch_size, pause_ms and sub_batch_size of NullifyOrphanRunnerIdOnCiBuilds migration
- 2022-03-17T11:05:45Z - Grow Elasticsearch cluster gitlab-logs-prod from 9 hot nodes to 11
- 2022-03-16T22:51:46Z - 2022-03-16: Add PVCs to Alertmanager
- 2022-03-16T20:52:42Z - [GPRD] - Further increase the number of concurrently archived WAL files to mitigate pileup (15 => 20)
- 2022-03-16T17:47:32Z - [gstg] Drain and reboot each frontend service member instance one at a time
- 2022-03-16T16:30:48Z - [gprd] Drain and reboot each frontend service member instance one at a time
- 2022-03-15T15:30:43Z - [gprd] Replace
redis-cache-sentinelinstances after changingmachine_typefromn1-standard-1ton2d-standard-4 - 2022-03-15T14:32:59Z - [GPRD] - Increase the number of concurrently archived WAL files to mitigate pileup (10 => 15)
- 2022-03-15T14:24:49Z - [GSTG] Reprovision HAProxy with a single NIC
- 2022-03-15T13:54:10Z - 2022-03-15: Delete marketo hook
- 2022-03-15T13:17:03Z - 2022-03-15: Disable api-gke-us-east1-d on gstg
- 2022-03-15T05:31:17Z - 2022-03-15: Increase zip cache expiration for pages
Incident Issues
- 2022-03-21T03:02:31Z - 2022-03-21: Multiple pages SLI, pingdom, blackbox alerts | reliability~3760141 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6657 - 2022-03-20T15:28:36Z - 2022-03-20: Blackbox probes for https://customers.gitlab.com are failing. | reliability~3760141 | ServiceCustomersDot |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6654 - 2022-03-20T07:58:28Z - 2022-03-20 Salesforce authentication failing in CustomersDot production | reliability~3760140 | ServiceCustomersDot |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6653 - 2022-03-20T06:45:04Z - 2022-03-20: The sshServices SLI of the frontend service (
mainstage) has an apdex violating SLO | reliability~3760140 | ServiceFrontend |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6652 - 2022-03-19T10:29:37Z - 2022-03-19: The goserver SLI of the gitaly service on node
file-hdd-01-stor-gprd.c.gitlab-production.internalhas an apdex violating SLO | reliability~3760141 | ServiceGitaly |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6651 - 2022-03-19T10:21:19Z - 2022-03-19: The goserver_op_service SLI of the gitaly service on node
file-cny-01-stor-gprd.c.gitlab-production.internalhas not received any traffic in the past 30m | reliability~3760142 | ServiceGitaly |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6650 - 2022-03-18T23:43:49Z - 2022-03-18: Number of Gitaly shards (for new repositories) is low | reliability~3760142 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6649 - 2022-03-18T18:47:40Z - 2022-03-18: QA failures on gstg-cny | reliability~3760140 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6643 - 2022-03-18T18:36:10Z - 2022-03-18: Post Deploy migrations Failure on Auto-Deploy | reliability~3760140 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6642 - 2022-03-18T17:39:32Z - 2022-03-18: The goserver_op_service SLI of the gitaly service on node
file-22-stor-gprd.c.gitlab-production.internalhas an error rate violating SLO | reliability~3760141 | ServiceGitaly |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6641 - 2022-03-18T15:21:15Z - 2022-03-18: The loadbalancer SLI of the web-pages service in region
us-easthas an error rate violating SLO | reliability~3760141 | ServicePages |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6640 - 2022-03-18T13:44:46Z - 2022-03-18: CloudSqlServiceCloudsqlTransactionsErrorSLOViolation | reliability~3760142 | ServiceGrafana |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6639 - 2022-03-18T04:42:15Z - 2022-03-18: CustomersDot: 500 error when attempting to change linked namespace | reliability~3760142 | ServiceCustomersDot |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6635 - 2022-03-18T01:57:55Z - 2022-03-18: The sentry_events SLI of the sentry service (
mainstage) has an apdex violating SLO | reliability~3760141 | ServiceSentry |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6634 - 2022-03-18T01:28:18Z - 2022-03-18: Some notification emails are delayed | reliability~3760141 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6633 - 2022-03-17T20:07:19Z - 2022-03-17: Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) | reliability~3760142 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6632 - 2022-03-17T16:42:24Z - 2022-03-17: Multiple versions of Gitaly have been running alongside one another | reliability~3760140 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6630 - 2022-03-17T16:22:07Z - 2022-03-17: QA gprd-cny smoke failure | reliability~3760140 | ServiceUnknown |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6629 - 2022-03-17T11:12:03Z - 2022-03-17: Site wide performance degradation | reliability~3760139 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6625 - 2022-03-17T05:58:53Z - 2022-03-17: Commit via the API fails with error 500 during a QA test | reliability~3760141 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6623 - 2022-03-17T05:32:26Z - 2022-03-17: Unable to load branches on Gitlab project | reliability~3760140 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6622 - 2022-03-16T21:57:23Z - 2022-03-16: The imagescaler SLI of the web service in region
us-east1-dhas an apdex violating SLO | reliability~3760141 | ~"Service::Workhorse" |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6619 - 2022-03-16T14:33:53Z - 2022-03-16: The Horizontal Pod Autoscaler Desired Replicas resource of the sidekiq service (main stage) has a saturation exceeding SLO and is close to its capacity limit. | reliability~3760141 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6615 - 2022-03-16T13:47:32Z - 2022-03-16: Postgres primary log disk filled up | reliability~3760141 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6614 - 2022-03-16T13:37:03Z - 2022-03-16: Reports of SSL certificate problem: unable to get local issuer certificate for some CI jobs | reliability~3760142 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6613 - 2022-03-16T12:24:07Z - 2022-03-16: gitaly server not available for gitlab-org/gitlab repository | reliability~3760142 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6612 - 2022-03-16T09:46:16Z - 2022-03-16: Prometheus on GKE stgsub-customers-gke has gone missing | reliability~3760142 | ServicePrometheus |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6611 - 2022-03-16T09:31:58Z - 2022-03-16: specs_without_cluster failing and preventing deployments | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6610 - 2022-03-15T21:19:50Z - 2022-03-15: PostgreSQL queries dominating total query time | reliability~3760140 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6609 - 2022-03-15T18:45:14Z - 2022-03-15 OpenSSL vulnerability for CVE-2022-0778 | reliability~3760140 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6605 - 2022-03-15T15:09:46Z - 2022-03-15: PubSub queuing high | reliability~3760141 | ServiceLogging |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6601 - 2022-03-15T14:55:24Z - 2022-03-15: The Cloud NAT Gateway Port Allocation resource of the nat service (main stage) has a saturation exceeding SLO and is close to its capacity limit | reliability~3760141 | ServiceNAT |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6600 - 2022-03-15T13:41:20Z - 2022-03-15: Missing objects in gitlab-org/gitlab | reliability~3760140 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6596 - 2022-03-15T12:40:33Z - 2022-03-15: Brief spike in artifact upload failures due to runner configuration change | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6594 - 2022-03-15T02:46:01Z - 2022-03-15 The GitLab job clone resource zlonk.datalytics.dailyx has failed. | reliability~3760142 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6591 - 2022-03-14T20:31:52Z - 2022-03-14: The loadbalancer SLI of the web-pages service in region
us-easthas an error rate violating SLO | reliability~3760141 | ServicePages |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6589 - 2022-03-14T17:23:51Z - 2022-03-14: Gitlab.com issues with async jobs | reliability~3760139 | ServiceFrontend |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6586 - 2022-03-14T16:46:47Z - 2022-03-14 - Users Blocked from Gitlab.com by Cloudflare DDoS Page | reliability~3760141 | ServiceCloudflare |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6585 - 2022-03-14T15:58:32Z - 2022-03-14: PubSub queuing high | reliability~3760141 | ServiceLogging |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6583 - 2022-03-14T14:24:30Z - 2022-03-13: Postgres pending WAL files on primary is high | reliability~3760141 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6581 - 2022-03-14T12:08:12Z - 2022-03-14: The loadbalancer SLI of the pages service (
mainstage) has an error rate violating SLO | reliability~3760141 | ServicePages |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6580 - 2022-03-14T11:21:44Z - 2022-03-14: PubSub queuing high | reliability~3760141 | ServiceLogging |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6579 - 2022-03-14T07:13:23Z - 2022-03-14: Containers for the
monitoringservice,mainare unable to unable to start. | reliability~3760141 | ServiceMonitoring-Other |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6577
CorrectiveAction Issues
- 2022-03-18T23:38:42Z - Add new CI/CD Limits per 2022-03-14 Incident
- 2022-03-18T23:30:16Z - Add note in gitaly-weights-assigner that it has assigned 0% to too many nodes
- 2022-03-18T14:22:01Z - Deduplicate grafana SQL alerts
- 2022-03-18T14:17:57Z - Push grafana logs to loging cluster
- 2022-03-16T18:48:48Z - Corrective action: The Horizontal Pod Autoscaler Desired Replicas resource of the sidekiq service (main stage) has a saturation exceeding SLO and is close to its capacity limit.
- 2022-03-16T06:32:40Z - Configure Horizontal Pod Autoscaling for pubsubbeat deployments based on PubSub metrics
- 2022-03-15T10:40:29Z - Corrective action: foo
- 2022-03-14T20:42:30Z - Can we rate-limit self-made API calls to gitlab.com
- 2022-03-14T20:05:52Z - Webhook Destroy work should not be in catchall
- 2022-03-14T19:56:28Z - Update Ingress Allow Lists for gitlab.com
- 2022-03-14T14:20:24Z - Enforce rate limits on TLS connections for GitLab Pages
Open Issue Stats
- Oncall issues : 3
- Change issues : 6
- Incident issues : 19
- Access Request : 0
- CorrectiveAction : 99
Open Change Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2022-03-18T19:25:49Z | Removal of foreign key fk_e4ef9c2f27 on PRD |
| 2022-03-17T17:28:34Z | Import projects into project_build_artifacts_size_refreshes |
| 2022-03-17T11:50:33Z | Adjust batch_size, pause_ms and sub_batch_size of NullifyOrphanRunnerIdOnCiBuilds migration |
| 2022-03-17T11:05:45Z | Grow Elasticsearch cluster gitlab-logs-prod from 9 hot nodes to 11 |
| 2022-03-15T15:30:43Z | [gprd] Replace redis-cache-sentinel instances after changing machine_type from n1-standard-1 to n2d-standard-4
|
| 2022-03-15T13:54:10Z | 2022-03-15: Delete marketo hook |
Open Incident Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2022-03-18T17:39:32Z | 2022-03-18: The goserver_op_service SLI of the gitaly service on node file-22-stor-gprd.c.gitlab-production.internal has an error rate violating SLO |
| 2022-03-18T01:28:18Z | 2022-03-18: Some notification emails are delayed |
| 2022-03-17T05:58:53Z | 2022-03-17: Commit via the API fails with error 500 during a QA test |
| 2022-03-14T14:24:30Z | 2022-03-13: Postgres pending WAL files on primary is high |
| 2022-02-12T19:20:49Z | 2022-02-12: Increased latency from us-east1-d for GCS buckets |
Open Oncall Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2021-09-17T19:35:34Z | Proposal: When an Incident is declared, output the latest changed feature flags into the incident issue |
| 2020-12-18T22:29:14Z | CI clones fail for repositories with a path ending in a period |
| 2020-03-30T13:38:11Z | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
Issues for Review during Incident Review Meeting
If there are any incidents you think would be good to review, please add them to the Agenda for the next meeting.
Edited by Kennedy Wanyangu