Weekly Reliability (SRE) Team Newsletter – On-call Period:2020-12-29 - 2021-01-05
Announcements
Welcome back to everyone who had a long holiday vacation! And, a huge thank you to everyone who covered on-call shifts over the holidays.
Engineering Week in Review Highlights:
- Chief People Officer (CPO) notes our engagement survey feedback is similar to years prior, where the level of confidence that actions will be taken as a result of this survey is low. See EVPE notes - https://docs.google.com/document/d/1qRqXBfw8HZJ-xJgvCMW44gTVqo52p6mrPU940DXFxZU/edit# - and please bring up concerns in the comments of this issue or with your manager privately.
Team Updates
Core Infrastructure
Datastores
-
We finished building the DB Benchmarking environment - phase I, that will host a production-like DB to perform different tests on it (load/stress, functionality, upgrades and more). Waiting now for the security/compliance green light to restore a Prod DB backup there and start using it.
-
We worked hard (thanks @Finotto) together with the devs, to mitigate some of the DB performance challenges we had during December. These were the two main fixes - alleviating DB load - so far:
We continue to work on this to find more optimizations, during this and next weeks.
Observability
On-Call During This Period
| Schedule | Username |
|---|---|
| SRE 8-hour Americas | Cindy Pallares |
| SRE 8-hour Americas | Nels Nelson |
| SRE 8-hour APAC | Graeme Gillies |
| SRE 8-hour EMEA | Michal Wasilewski |
PagerDuty Incidents
* Number of incidents: **20**
Show/Hide Table
| Created | Summary |
|---|---|
| 2020-12-29T06:27:50Z | [34042] Firing 1 - Increased Error Rate Across Fleet |
| 2020-12-29T11:26:02Z | [34051] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down |
| 2020-12-29T11:26:04Z | [34052] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
| 2020-12-29T11:26:04Z | [34053] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down |
| 2020-12-29T11:26:20Z | [34054] Firing 2 - IncreasedErrorRateOtherBackends |
| 2020-12-29T11:26:35Z | [34055] Firing 1 - High Error Rate on Front End Web |
| 2020-12-29T11:27:16Z | [34056] Firing 8 - BlackboxProbeFailures |
| 2020-12-30T00:05:49Z | [34091] Firing 1 - Last successful WAL-G basebackup was seen 42.87s ago for env gprd. |
| 2020-12-30T03:07:47Z | [34095] Firing 1 - GitLab Job has failed |
| 2020-12-30T06:10:47Z | [34097] Firing 1 - Last successful WAL-G basebackup was seen 48.95s ago for env gprd. |
| 2020-12-30T09:20:08Z | [34101] Firing 1 - prometheus is unreachable |
| 2021-01-01T01:41:50Z | [34131] Firing 2 - IncreasedErrorRateOtherBackends |
| 2021-01-01T03:18:20Z | [34133] Firing 2 - IncreasedErrorRateOtherBackends |
| 2021-01-01T05:23:35Z | [34139] Firing 2 - IncreasedErrorRateOtherBackends |
| 2021-01-01T07:11:05Z | [34142] Firing 2 - IncreasedErrorRateOtherBackends |
| 2021-01-01T07:39:05Z | [34144] Firing 2 - IncreasedErrorRateOtherBackends |
| 2021-01-02T00:05:48Z | [34155] Firing 1 - Last successful WAL-G basebackup was seen 42.77s ago for env gprd. |
| 2021-01-02T01:26:32Z | [34157] Firing 1 - GitLab Job has failed |
| 2021-01-02T06:10:48Z | [34159] Firing 1 - Last successful WAL-G basebackup was seen 48.86s ago for env gprd. |
| 2021-01-04T14:56:15Z | [34212] Firing 1 - Alertmanager is failing sending notifications |
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 0
- Incident Issues : 12
- CorrectiveAction Issues : 0
Change Issues
Incident Issues
- 2021-01-04T15:42:23Z - 2021-01-04 alertmanager failing to send notifications | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3273 - 2021-01-04T13:00:15Z - 2021-01-04: Job logs slow to load on first view | reliability~3760142 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3271 - 2021-01-03T05:50:31Z - 2021-01-03: The workhorse SLI of the web service (
cnystage) has an apdex violating SLO | reliability~3760141 | ServiceWeb |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3270 - 2021-01-02T14:55:33Z - 2021-01-02 increased error rates and latency on cny | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3269 - 2021-01-01T01:45:50Z - 2021-01-01: IncreasedErrorRateOtherBackends | reliability~3760140 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3268 - 2020-12-31T08:41:13Z - 2020-12-30: Service desk replies not sent to authors | reliability~3760142 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3267 - 2020-12-31T05:04:51Z - 2020-12-31: Intermittent errors on clone on a few projects | reliability~3760141 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3266 - 2020-12-30T03:12:58Z - 2020-12-30: WAL-G Backup failed | reliability~3760141 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3263 - 2020-12-29T22:38:30Z - 2020-12-29: The gitlab_zone SLI of the waf service (
mainstage) has an error rate violating SLO | reliability~3760142 | ServiceCloudflare |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3262 - 2020-12-29T17:21:44Z - 2020-12-29: The imagescaler SLI of the web service (main stage) has an error rate violating SLO | reliability~3760142 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3260 - 2020-12-29T11:29:44Z - 2020-12-29 errors across the fleet | reliability~3760140 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3259 - 2020-12-29T06:38:14Z - 2020-12-29: Increased error rate in web cny | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3257
CorrectiveAction Issues
Open Issue Stats
- Oncall issues : 8
- Change issues : 0
- Incident issues : 26
- Access Request : 3
- CorrectiveAction : 136
Open Change Issues
Show/Hide Table
| Created | Summary |
|---|
Open Incident Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2021-01-04T13:00:15Z | 2021-01-04: Job logs slow to load on first view |
| 2020-12-31T05:04:51Z | 2020-12-31: Intermittent errors on clone on a few projects |
Open Oncall Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2020-12-18T22:29:14Z | CI clones fail for repositories with a path ending in a period |
| 2020-10-27T14:20:44Z | One-Time Export for micro_x |
| 2020-09-14T18:52:09Z | PS Congregate VM for GitHost to GitLab.com Migration - Afilias |
| 2020-09-02T13:47:51Z | disable-chef-client isn't preserved over reboots |
| 2020-08-11T16:39:37Z | Investigate slow child pipeline triggering on pre.gitlab.com |
| 2020-07-28T18:19:35Z | PS Congregate VM for BitBucket Server to GitLab.com Migration |
| 2020-07-28T17:43:40Z | Project Import Request - ciorg/bridge/am-child-pool/api |
| 2020-03-30T13:38:11Z | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
Edited by Alberto Ramos