Weekly Reliability (SRE) Team Newsletter – On-call Period: 2021-12-21 - 2021-12-28
Announcements
Engineering Week in Review Highlights:
Team Updates
On-Call During This Period
Schedule | Username |
---|---|
SRE 8-hour Americas | Hendrik Meyer |
SRE 8-hour Americas | Marcel Chacon |
SRE 8-hour APAC | Pierre Guinoiseau |
SRE 8-hour EMEA | Ahmad Sherif |
SRE 8-hour EMEA | Michal Wasilewski |
PagerDuty Incidents
See the 1 week report for acknowledged PD pages (long-term trend)
Alerts Volume
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 4
- Incident Issues : 13
- CorrectiveAction Issues : 0
Change Issues
- 2021-12-23T14:34:40Z - 2021-12-24: [gprd] Remove GITLAB_USE_REDIS_SESSIONS_STORE for gprd
- 2021-12-23T13:07:02Z - 2021-12-23: Only send traffic to the main stage when canary=false
- 2021-12-23T04:56:07Z - 2021-12-24: Add deployment labels to fluentd-archiver
- 2021-12-20T21:33:54Z - Update Consul pgbouncer-healthcheck query
Incident Issues
- 2021-12-25T06:23:20Z - 2021-12-25 PostgreSQL errors: could not open relation with OID 1375605314 | reliability~3760142 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6120
- 2021-12-25T03:23:51Z - 2021-12-25 The server SLI of the web-pages service in region <code data-sourcepos="53:89-53:96">us-east1</code> has an error rate violating SLO | reliability~3760141 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6119
- 2021-12-24T23:29:48Z - 2021-12-25 - The goserver SLI of the gitaly service on node <code data-sourcepos="54:88-54:133">file-37-stor-gprd.c.gitlab-production.internal</code> has an apdex violating SLO | reliability~3760141 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6118
- 2021-12-24T11:53:06Z - 2021-12-24 The goserver SLI of the gitaly service on node <code data-sourcepos="55:86-55:131">file-40-stor-gprd.c.gitlab-production.internal</code> has an apdex violating SLO | reliability~3760141 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6117
- 2021-12-23T08:53:35Z - 2021-12-23 The goserver SLI of the gitaly service on node
file-40-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO | reliability~3760141 | ServiceGitaly |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6112
- 2021-12-23T08:05:01Z - 2021-12-23 The proxy SLI of the praefect service (
main
stage) has an apdex violating SLO | reliability~3760141 | ServicePraefect |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6111
- 2021-12-22T21:29:40Z - 2021-12-22 Error pushing Windows container registry images | reliability~3760141 | ServiceContainer Registry |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6109
- 2021-12-22T18:26:30Z - 2021-12-22 QA in Staging seeing high rate of HTTP429s | reliability~3760140 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6108
- 2021-12-22T13:38:07Z - 2021-12-22 Slack api issues | reliability~3760141 | ServiceAlertManager |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6106
- 2021-12-22T09:46:00Z - 2021-12-22: auto_deploy broken because of missing gitlab-elasticsearch-indexer tag | reliability~3760140 | ServiceSearch |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6104
- 2021-12-21T14:21:06Z - 2021-12-21 The PGBouncer Client Connections per Process (Replicas) resource of the patroni service (main stage), component has a saturation exceeding SLO and is close to its capacity limit. | reliability~3760141 | ServicePgbouncer |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6102
- 2021-12-20T14:43:44Z - 2021-12-20: The loadbalancer_https SLI of the web-pages service (<code data-sourcepos="63:94-63:97">main</code> stage) has an error rate violating SLO | reliability~3760141 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6099
- 2021-12-20T11:36:37Z - 2021-12-20: 5 Elasticsearch shards unavailable | reliability~3760141 | ServiceSearch |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6097
CorrectiveAction Issues
- 2021-12-22T16:36:06Z - Migrate dev.gitlab.org from Azure to GCP
- 2021-12-22T14:24:39Z - Incident issue creation failed due to slack api errors
Open Issue Stats
- Oncall issues : 3
- Change issues : 1
- Incident issues : 32
- Access Request : 0
- CorrectiveAction : 118
Open Change Issues
Show/Hide Table
Created | Summary |
---|---|
2021-12-16T10:45:51Z | Enable gitlab-sshd on gstg |
Open Incident Issues
Show/Hide Table
Created | Summary |
---|---|
2021-12-25T06:23:20Z | 2021-12-25 PostgreSQL errors: could not open relation with OID 1375605314 |
2021-12-24T11:53:06Z | 2021-12-24 The goserver SLI of the gitaly service on node file-40-stor-gprd.c.gitlab-production.internal has an apdex violating SLO |
2021-12-23T08:05:01Z | 2021-12-23 The proxy SLI of the praefect service (main stage) has an apdex violating SLO |
2021-12-22T21:29:40Z | 2021-12-22 Error pushing Windows container registry images |
2021-12-16T18:26:31Z | 2021-12-16: Prometheus notification endpoint timeouts |
2021-10-06T00:51:01Z | 2021-10-06 Some interrupted Sidekiq jobs going missing |
Open Oncall Issues
Show/Hide Table
Created | Summary |
---|---|
2021-09-17T19:35:34Z | Proposal: When an Incident is declared, output the latest changed feature flags into the incident issue |
2020-12-18T22:29:14Z | CI clones fail for repositories with a path ending in a period |
2020-03-30T13:38:11Z | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
Issues for Review during Incident Review Meeting
If there are any incidents you think would be good to review, please add them to the Agenda for the next meeting.
Edited by Kennedy Wanyangu