OnCall report for period: 2020-03-03 - 2020-03-10
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Ahmad Sherif |
SRE 8 Hour | Cindy Pallares |
SRE 8 Hour | Devin Sylva |
SRE 8 Hour | Craig Barrett |
SRE 8 Hour | Craig Miskell |
PagerDuty Incidents
* Number of incidents: **46**
Show/Hide Table
Created | Summary |
---|---|
2020-03-03T12:24:13Z | [18110] Firing 1 - Last WALE backup was seen 17m 42s ago. |
2020-03-03T13:26:56Z | [18111] Firing 4 - PostgreSQL_ReplicaStaleXmin |
2020-03-03T18:08:15Z | [18121] Firing 1 - CPU use percent is extremely high on pubsub-sidekiq-inf-gprd.c.gitlab-production.internal for the past 2 hours. |
2020-03-03T22:38:31Z | [18134] Firing 1 - CPU use percent is extremely high on pubsub-workhorse-inf-gprd.c.gitlab-production.internal for the past 2 hours. |
2020-03-04T10:15:21Z | [18150] Firing 1 - Alertmanager is failing sending notifications |
2020-03-04T10:15:22Z | [18151] Firing 1 - Alertmanager is failing sending notifications |
2020-03-04T14:55:50Z | [18155] Firing 1 - Alertmanager is failing sending notifications |
2020-03-04T14:55:51Z | [18156] Firing 1 - Alertmanager is failing sending notifications |
2020-03-04T15:33:06Z | [18157] Firing 1 - Alertmanager is failing sending notifications |
2020-03-04T17:06:12Z | [18160] Firing 11 - PostgreSQL_ReplicaStaleXmin |
2020-03-04T17:16:25Z | [18161] HAProxy required for accidental P1/S1 security fix commit |
2020-03-05T13:48:50Z | [18183] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-05T14:44:59Z | [18184] Firing 4 - PostgreSQL_ReplicaStaleXmin |
2020-03-05T19:45:10Z | [18195] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2020-03-05T20:18:50Z | [18198] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-05T21:01:51Z | [18200] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-05T22:01:07Z | [18204] Firing 1 - Gitaly error rate is too high: 14.40 |
2020-03-05T22:45:05Z | [18205] Firing 1 - Gitaly error rate is too high: 23.38 |
2020-03-05T22:58:05Z | [18206] Firing 1 - Gitaly error rate is too high: 13.16 |
2020-03-05T23:13:04Z | [18208] Firing 1 - Gitaly error rate is too high: 16.29 |
2020-03-07T08:19:12Z | [18279] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-03-07T08:19:25Z | [18280] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-07T09:32:13Z | [18284] Firing 1 - WAL-E replication has stopped |
2020-03-07T10:26:56Z | [18288] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2020-03-08T06:10:51Z | [18318] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-08T19:20:06Z | [18346] Firing 1 - Alertmanager is failing sending notifications |
2020-03-09T04:20:06Z | [18359] Firing 1 - Alertmanager is failing sending notifications |
2020-03-09T04:20:23Z | [18360] Firing 1 - Alertmanager is failing sending notifications |
2020-03-09T05:02:22Z | [18364] Firing 1 - Alertmanager is failing sending notifications |
2020-03-09T05:02:22Z | [18363] Firing 1 - Alertmanager is failing sending notifications |
2020-03-09T05:17:22Z | [18365] Firing 1 - Alertmanager is failing sending notifications |
2020-03-09T05:17:23Z | [18366] Firing 1 - Alertmanager is failing sending notifications |
2020-03-09T09:28:40Z | [18369] Firing 1 - Gitaly is down on file-27-stor-gprd.c.gitlab-production.internal |
2020-03-09T14:10:04Z | [18382] Firing 1 - Gitaly error rate is too high: 35.53 |
2020-03-09T16:52:51Z | [18385] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-09T17:01:21Z | [18389] Firing 2 - AlertmanagerNotificationsFailing |
2020-03-09T17:01:22Z | [18390] Firing 2 - AlertmanagerNotificationsFailing |
2020-03-09T17:38:58Z | [18393] Firing 1 - Chef client failures have reached critical levels |
2020-03-09T17:41:14Z | [18395] Firing 1 - Chef client failures have reached critical levels |
2020-03-09T23:25:35Z | [18410] Firing 1 - Alertmanager is failing sending notifications |
2020-03-09T23:30:35Z | [18411] Firing 2 - AlertmanagerNotificationsFailing |
2020-03-09T23:56:20Z | [18417] Firing 1 - Alertmanager is failing sending notifications |
2020-03-10T00:02:21Z | [18419] Firing 1 - Alertmanager is failing sending notifications |
2020-03-10T00:02:35Z | [18420] Firing 1 - Alertmanager is failing sending notifications |
2020-03-10T01:41:28Z | [18425] Firing 1 - Chef client failures have reached critical levels |
2020-03-10T02:14:13Z | [18427] Firing 3 - ChefClientErrorCritical |
7 Day Issue Stats
- Oncall issues : 1
- Access Request : 0
- Change Issues : 1
- Incident Issues : 13
- CorrectiveAction Issues : 0
Change Issues
- 2020-03-05T21:23:37Z - Remove deprecated Digital Ocean instance of design.gitlab.com - devin
Incident Issues
- 2020-03-09T23:56:52Z - 2020-03-09 Alertmanager is failing sending notifications - devin | ~S4 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1745
- 2020-03-09T17:09:20Z - 2020-03-09: Degreded performance for web, frontend, api - unassigned | ~S2 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1743
- 2020-03-09T09:51:08Z - 2020-03-09: File-27 was briefly inaccessible - ahmadsherif | ~S1 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1740
- 2020-03-09T04:38:09Z - 2020-03-09: Sidekiq had apdex score (latency) below SLO - craig | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1739
- 2020-03-08T19:17:41Z - 2020-03-08: Sidekiq had apdex score (latency) below SLO - unassigned | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1738
- 2020-03-08T04:48:31Z - 2020-03-08: ci-runners is not meeting its latency SLOs - craig | ~S4 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1737
- 2020-03-06T13:42:40Z - Post deployment migration failed in production due to statement timeout - rspeicher | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1735
- 2020-03-06T10:12:58Z - 2020-03-06: False alert about low Sidekiq SLO due to metrics changes - ahmadsherif | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1733
- 2020-03-05T22:28:19Z - 2020-03-05: Gitaly error rate is too high on file-45 - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1731
- 2020-03-05T20:34:08Z - 2020-03-05: Gitaly latency on file-praefect-02-stor-gprd - unassigned | ~S4 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1729
- 2020-03-05T20:34:05Z - 2020-03-05: postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1728
- 2020-03-04T21:44:02Z - 2020-03-04: Redis-cache has a apdex score below SLO - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1726
- 2020-03-03T15:07:39Z - 2020-03-03: Spikes in redis-cache latency apdex - ahmadsherif | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1722
CorrectiveAction Issues
- 2020-03-07T13:17:52Z - Sentinel was not running on redis-02 for an extended period of time - andrewn
Open Issue Stats
- Oncall issues : 3
- Change issues : 2
- Incident issues : 4
- Access Request : 5
- CorrectiveAction : 72
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-05T21:23:37Z | devin | Remove deprecated Digital Ocean instance of design.gitlab.com |
2019-10-15T15:13:30Z | nnelson | Migrate large projects off file-34-stor-gprd to file-44-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-09T17:09:20Z | unassigned | 2020-03-09: Degreded performance for web, frontend, api |
2020-03-09T04:38:09Z | craig | 2020-03-09: Sidekiq had apdex score (latency) below SLO |
2020-02-29T16:47:36Z | unassigned | 2020-02-29: S1/P1 security incident |
2020-02-25T20:17:07Z | unassigned | 2020-02-25: Postgres Replication lag is over 3 hours on archive recovery replica |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-01-16T06:07:03Z | aamarsanaa | Incremental rollout for the Pages new API based config source |
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant
Edited by Dave Smith