Weekly Reliability (SRE) Team Newsletter – On-call Period: 2021-09-14 - 2021-09-21
Announcements
Engineering Week in Review Highlights:
Team Updates
On-Call During This Period
| Schedule | Username |
|---|---|
| SRE 8-hour Americas | Alex Hanselka |
| SRE 8-hour Americas | Hendrik Meyer |
| SRE 8-hour Americas | Nels Nelson |
| SRE 8-hour APAC | Craig Miskell |
| SRE 8-hour APAC | Pierre Guinoiseau |
| SRE 8-hour EMEA | Alejandro Rodriguez |
PagerDuty Incidents
See the 1 week report for acknowledged PD pages
7 Day Issue Stats
- Oncall issues : 1
- Access Request : 0
- Change Issues : 8
- Incident Issues : 21
- CorrectiveAction Issues : 0
Change Issues
- 2021-09-17T15:24:25Z - GPRD: wal-g update to v1.1
- 2021-09-17T09:49:45Z - 2021-09-17: Set redis-01 to have a higher failover priority
- 2021-09-17T09:20:12Z - 2021-09-17: Deploy MR to prepare for traffic increase to canary during hard PCL
- 2021-09-16T01:15:24Z - Cleanup problematic sessions in redis
- 2021-09-15T21:21:21Z - Add memory and disk space to redis
- 2021-09-15T12:25:00Z - Enter Nexus credentials on CustomersDot staging & production
- 2021-09-15T11:41:29Z - Delete customer specific JiraConnectInstallation record
- 2021-09-14T09:16:21Z - 2021-09-14: Implement rate limits for the Files API on GitLab.com
Incident Issues
- 2021-09-20T10:25:20Z - 2021-09-20: Potentially malicious activity on gitlab pages | reliability~3760141 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5565 - 2021-09-20T02:35:43Z - 2021-09-20 The GitLab job "walg-basebackup" has failed. | reliability~3760142 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5562 - 2021-09-19T21:50:09Z - 2021-09-19: The grpc_requests SLI of the kas service (
mainstage) has an error rate violating SLO | reliability~3760141 | ServiceKAS |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5561 - 2021-09-19T02:12:57Z - 2021-09-19 The goserver SLI of the gitaly service on node
file-58-stor-gprd.c.gitlab-production.internalhas an apdex violating SLO | reliability~3760142 | ServiceGitaly |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5560 - 2021-09-18T18:11:35Z - 2021-09-18: Review requested for merge request to add CNAME DNS record: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2971 | reliability~3760142 | ServiceInfrastructure |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5559 - 2021-09-18T11:06:50Z - 2021-09-18: Gitaly is down on file-43-stor-gprd.c.gitlab-production.internal | reliability~3760140 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5558 - 2021-09-18T06:12:57Z - 2021-09-18 The GitLab job "walg-basebackup" resource "walg-basebackup" has failed. | reliability~3760142 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5557 - 2021-09-16T15:52:30Z - 2021-09-16 FilesystemFullSoon alerts on runners | reliability~3760142 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5551 - 2021-09-16T05:40:28Z - 2021-09-16 The GitLab job "walg-basebackup" resource "walg-basebackup" has failed. | reliability~3760141 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5550 - 2021-09-16T03:46:37Z - 2021-09-16 - Users with private commit emails cannot create issues or MRs assigned to themselves in Canary | reliability~3760141 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5549 - 2021-09-15T16:31:33Z - 2021-09-15: Redis-persistent memory usage approaching saturation | reliability~3760140 | ServiceRedis |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5546 - 2021-09-15T16:19:01Z - 2021-09-15: GitLab.com was temporarily unavailable | reliability~3760141 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5545 - 2021-09-15T11:12:45Z - 2021-09-15: The HPA Desired Replicas resource of the sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit | reliability~3760141 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5542 - 2021-09-15T00:31:22Z - 2021-09-15 The workhorse SLI of the api service violating SLO | reliability~3760141 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5541 - 2021-09-14T20:18:54Z - 2021-09-14: Elevated error rates for GitLab.com | reliability~3760140 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5539 - 2021-09-14T14:42:04Z - 2021-09-14: Last successful WAL-G basebackup was seen 30.086379999981986 hours ago for env gprd. | reliability~3760141 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5537 - 2021-09-14T10:07:13Z - 2021-09-14: The goserver SLI of the gitaly service on node
file-26-stor-gprd.c.gitlab-production.internalhas an apdex violating SLO | reliability~3760141 | ServiceGitaly |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5535 - 2021-09-14T04:22:09Z - 2021-09-14 Some prometheii lost track of alertmanager | reliability~3760142 | ServicePrometheus |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5532 - 2021-09-14T03:54:16Z - 2021-09-14 dev.gitlab.org automatic daily upgrade failed | reliability~3760140 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5531 - 2021-09-14T01:43:22Z - 2021-09-14 ops.gitlab.net certificate inconsistencies | reliability~3760141 | ServiceInfrastructure |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5530 - 2021-09-14T00:23:05Z - Mehdiamer | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5529
CorrectiveAction Issues
- 2021-09-20T02:58:59Z - nginx-ingress-controller scale up events result in 502's for clients
- 2021-09-16T06:39:03Z - Upgrade Prometheus to 2.30.0
- 2021-09-15T12:34:38Z - HAProxy configuration audit with GKE backends
- 2021-09-14T12:06:37Z - Manage all GCP project metadata in Terraform
- 2021-09-14T09:59:47Z - Alert on high 503 error rate
- 2021-09-14T05:07:15Z - Prometheus sometimes doesn't refresh it's alertmanager IPs when they cycle
- 2021-09-13T23:46:59Z - Fix gitlab-redis-cli on redis persistent hosts
Open Issue Stats
- Oncall issues : 4
- Change issues : 3
- Incident issues : 14
- Access Request : 0
- CorrectiveAction : 216
Open Change Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2021-09-17T15:24:25Z | GPRD: wal-g update to v1.1 |
| 2021-09-15T21:21:21Z | Add memory and disk space to redis |
| 2021-09-14T09:16:21Z | 2021-09-14: Implement rate limits for the Files API on GitLab.com |
Open Incident Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2021-09-18T18:11:35Z | 2021-09-18: Review requested for merge request to add CNAME DNS record: https://ops.gitlab.net/gitlab-com/gitlab-com-infrastructure/-/merge_requests/2971 |
Open Oncall Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2021-09-17T19:35:34Z | Proposal: When an Incident is declared, output the latest changed feature flags into the incident issue |
| 2020-12-18T22:29:14Z | CI clones fail for repositories with a path ending in a period |
| 2020-09-02T13:47:51Z | disable-chef-client isn't preserved over reboots |
| 2020-03-30T13:38:11Z | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
Edited by ops-gitlab-net