Weekly Reliability (SRE) Team Newsletter – On-call Period: 2022-06-21 - 2022-06-28
Announcements
Engineering Week in Review Highlights:
Team Updates
On-Call During This Period
| Schedule | Username |
|---|---|
| SRE 8-hour Americas | Cameron McFarland |
| SRE 8-hour Americas | Matt Smiley |
| SRE 8-hour Americas | Marcel Chacon |
| SRE 8-hour APAC | Pierre Guinoiseau |
| SRE 8-hour EMEA | Rehab Hassanein |
| SRE 8-hour EMEA | Igor Wiedler |
PagerDuty Incidents
See the 1 week report for acknowledged PD pages (long-term trend)
Alerts Volume
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 11
- Incident Issues : 30
- CorrectiveAction Issues : 0
Change Issues
- 2022-06-27T04:38:28Z - 2022-06-27: Update remaining k8s node pools to use SSD
- 2022-06-27T02:56:00Z - 2022-06-27: No-op update min number of nodes on all node pools
- 2022-06-24T01:09:12Z - Draft: add stage label to fluentd-archiver pods
- 2022-06-23T21:07:49Z - Draft: Enable application limit for incident management alerts
- 2022-06-23T03:31:48Z - Delayed project authorization updates during bulk group share deletions
- 2022-06-22T09:38:31Z - [gstg] Clean up corrupt commit-graphs in
gitlab-org/gitlab - 2022-06-21T15:05:30Z - [gprd] Remove Omnibus gitconfig
- 2022-06-20T14:38:47Z - Rollout the "Runner separation by plan" feature flag on GitLab.com
- 2022-06-20T12:05:24Z - 2022-06-20: Lower pages rate limit to 600 req/second per domain
- 2022-06-20T11:09:28Z - 2022-06-20: Pages reduce rate limit source ip burst - prod
- 2022-06-20T11:00:52Z - 2022-06-20: Pages reduce rate limit source ip burst - non-prod
Incident Issues
- 2022-06-26T08:31:22Z - 2022-06-26: CiRunnersServiceSharedRunnerQueuesApdexSLOViolation | reliability~3760141 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7324 - 2022-06-24T23:08:52Z - 2022-06-24: Number of active Gitaly shards is low | reliability~3760141 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7323 - 2022-06-24T21:25:17Z - 2022-06-24: patroni-ci1-09 node initializing. | reliability~3760142 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7322 - 2022-06-23T15:38:59Z - 2022-06-23: Pages request flood | reliability~3760141 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7318 - 2022-06-23T12:55:39Z - 2022-06-23: WALGBaseBackupFailed | reliability~3760141 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7317 - 2022-06-23T10:37:33Z - 2022-06-23: CiRunnersServiceQueuingQueriesDurationApdexSLOViolation | reliability~3760140 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7316 - 2022-06-23T08:44:55Z - 2022-06-23: RegistryServiceGarbagecollectorErrorSLOViolation | reliability~3760141 | ServiceContainer Registry |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7315 - 2022-06-23T07:49:44Z - 2022-06-23: QA jobs for canary are failing | reliability~3760140 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7314 - 2022-06-23T07:43:58Z - 2022-06-23: SidekiqServiceShardUrgentCpuBoundApdexSLOViolation | reliability~3760141 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7313 - 2022-06-23T07:36:05Z - 2022-06-23: Pipeline for packager is missing | reliability~3760140 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7312 - 2022-06-23T06:54:43Z - 2022-06-23: sentry.gitlab.net is unavailable | reliability~3760141 | ServiceSentry |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7311 - 2022-06-23T06:25:11Z - 2022-06-22: Increase in error rates across all services | reliability~3760140 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7310 - 2022-06-22T20:50:41Z - 2022-06-22: Degraded inter-service network connectivity | reliability~3760141 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7308 - 2022-06-22T20:06:32Z - 2022-06-22: Maven package uploads result in 500 errors | reliability~3760141 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7307 - 2022-06-22T15:55:51Z - 2022-06-22: Workhorse apdex dips in canary | reliability~3760141 | ~"Service::Workhorse" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7305 - 2022-06-22T14:44:19Z - 2022-06-22: WebServiceWorkhorseApdexSLOViolation in canary | reliability~3760141 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7304 - 2022-06-22T13:33:05Z - 2022-06-22: Elastic monitoring-cluster high disk allocation | reliability~3760142 | ServiceLogging |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7302 - 2022-06-22T11:52:17Z - 2022-06-22: WebsocketsServiceLoadbalancerErrorSLOViolation | reliability~3760141 | ServiceWebsockets |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7301 - 2022-06-22T09:53:10Z - 2022-06-22: alertmanager webhook integration failing with timeouts | reliability~3760142 | ServiceAlertManager |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7300 - 2022-06-22T06:54:38Z - 2022-06-22: Prometheus PVC approaching saturation in gprd-us-east1-c | reliability~3760141 | ServicePrometheus |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7298 - 2022-06-21T17:27:57Z - 2022-06-21: Long-lived transaction in primary db | reliability~3760142 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7297 - 2022-06-21T07:17:16Z - 2022-06-21: Site outage due to CDN connectivity issues | reliability~3760139 | ServiceCloudflare |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7288 - 2022-06-21T02:29:11Z - 2022-06-21: Elevated errors from Gitaly | reliability~3760140 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7287 - 2022-06-20T20:42:25Z - 2022-06-20: Bad Gitaly/Omnibus change re-deployed | reliability~3760140 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7286 - 2022-06-20T17:27:29Z - 2022-06-20: Canary web error rate is elevated during deploy | reliability~3760140 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7284 - 2022-06-20T15:51:53Z - 2022-06-20: CI prometheus nodes running low on OS disk space | reliability~3760142 | ServiceMonitoring-Other |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7283 - 2022-06-20T15:17:22Z - 2022-06-20: The grafana_google_lb SLI of the monitoring service (main stage) has an error rate violating SLO | reliability~3760141 | ServiceMonitoring-Other |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7282 - 2022-06-20T13:51:28Z - 2022-06-20: MonitoringServiceGrafanaDatasourcesErrorSLOViolation | reliability~3760142 | ServiceThanos |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7280 - 2022-06-20T13:12:45Z - 2022-06-20: LoggingServiceFluentdLogOutputErrorSLOViolation | reliability~3760141 | ServiceLogging |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7279 - 2022-06-20T07:12:51Z - 2022-06-20: Disk space approaching limits on prometheus-01-gitlab-runners | reliability~3760142 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7274
CorrectiveAction Issues
- 2022-06-22T19:21:04Z - Resize instance types for websockets
- 2022-06-22T18:07:20Z - Add Kibana fields for Postgres autovacuum auto-analyze log messages
- 2022-06-22T13:23:04Z - Corrective action: The Postgres Autovacuum Activity (non-sampled) resource of the patroni service (main stage) has a saturation exceeding SLO and is close to its capacity limit.
- 2022-06-22T03:18:21Z - Look into isolation AuthorizedProjectWorker Sidekiq Shard
- 2022-06-21T14:18:55Z - Corrective action: WAL-G Turbo mode needs to be used on restore command to not fall further behind
- 2022-06-21T14:13:15Z - Corrective action: The Patroni snapshot script should not over-run itself
- 2022-06-20T16:24:04Z - Corrective action: The grafana_google_lb SLI of the monitoring service (main stage) has an error rate violating SLO
- 2022-06-20T08:33:15Z - Risk Analysis for Teleport/Console access directly into Gitlab's Main/CI patroni clusters
Open Issue Stats
- Oncall issues : 2
- Change issues : 9
- Incident issues : 14
- Access Request : 0
- CorrectiveAction : 98
Open Change Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2022-06-27T04:38:28Z | 2022-06-27: Update remaining k8s node pools to use SSD |
| 2022-06-27T02:56:00Z | 2022-06-27: No-op update min number of nodes on all node pools |
| 2022-06-24T01:09:12Z | Draft: add stage label to fluentd-archiver pods |
| 2022-06-23T21:07:49Z | Draft: Enable application limit for incident management alerts |
| 2022-06-23T03:31:48Z | Delayed project authorization updates during bulk group share deletions |
| 2022-06-17T18:50:05Z | Setup GitLab for Jira app OAuth |
| 2022-05-23T10:43:01Z | 2022-07-02 05:00 UTC: Decompose GitLab.com's PostgreSQL Database into Main and CI |
| 2022-05-18T06:17:33Z | 2022-05-18: Enable inactive projects deletion on GitLab.com |
| 2022-04-26T22:35:59Z | Enable OSQuery on console hosts |
Open Incident Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2022-06-24T23:08:52Z | 2022-06-24: Number of active Gitaly shards is low |
| 2022-06-23T08:44:55Z | 2022-06-23: RegistryServiceGarbagecollectorErrorSLOViolation |
| 2022-06-22T15:55:51Z | 2022-06-22: Workhorse apdex dips in canary |
| 2022-06-22T13:33:05Z | 2022-06-22: Elastic monitoring-cluster high disk allocation |
| 2022-06-22T09:53:10Z | 2022-06-22: alertmanager webhook integration failing with timeouts |
| 2022-06-02T11:53:21Z | 2022-06-02: Attempting to save changes to a large issue on ops.gitlab.net sometimes fails (403) |
Open Oncall Issues
Show/Hide Table
| Created | Summary |
|---|---|
| 2021-09-17T19:35:34Z | Proposal: When an Incident is declared, output the latest changed feature flags into the incident issue |
| 2020-12-18T22:29:14Z | CI clones fail for repositories with a path ending in a period |
Issues for Review during Incident Review Meeting
If there are any incidents you think would be good to review, please add them to the Agenda for the next meeting.
Edited by ops-gitlab-net