OnCall report for period: 2020-03-17 - 2020-03-24
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Hendrik Meyer |
SRE 8 Hour | Cameron McFarland |
SRE 8 Hour | Craig Miskell |
SRE 8 Hour | Craig Furman |
SRE 8 Hour | Graeme Gillies |
PagerDuty Incidents
* Number of incidents: **61**
Show/Hide Table
Created | Summary |
---|---|
2020-03-17T06:18:51Z | [18825] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-17T09:40:07Z | [18828] Firing 1 - Alertmanager is failing sending notifications |
2020-03-17T09:40:08Z | [18829] Firing 1 - Alertmanager is failing sending notifications |
2020-03-17T15:15:07Z | [18834] Firing 1 - HPA unable to scale up |
2020-03-17T20:29:57Z | [18844] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-18T04:30:12Z | [18853] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-18T18:04:56Z | [18861] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-18T18:46:13Z | [18863] Firing 1 - Last WALE backup was seen 20m 3s ago. |
2020-03-18T21:08:33Z | [18867] This is an incident exercise. |
2020-03-18T21:08:33Z | [18868] This is an incident exercise. |
2020-03-18T21:22:08Z | [18869] Firing 1 - thanos is restarting frequently |
2020-03-19T16:13:46Z | [18886] Firing 1 - Large amount of new_note sidekiq queued jobs: 955 |
2020-03-19T18:25:21Z | [18890] Firing 2 - IncreasedErrorRateOtherBackends |
2020-03-19T18:27:29Z | [18891] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-19T18:27:29Z | [18892] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-19T18:28:36Z | [18893] Firing 48 - IncreasedBackendConnectionErrors |
2020-03-19T18:28:56Z | [18894] GitLab Appears to be throwing many 500s |
2020-03-19T18:28:57Z | [18895] GitLab Appears to be throwing many 500s |
2020-03-19T20:09:16Z | [18898] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2020-03-19T20:09:20Z | [18899] Firing 2 - IncreasedErrorRateOtherBackends |
2020-03-19T20:10:04Z | [18900] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2020-03-19T20:10:38Z | [18902] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2020-03-19T20:10:44Z | [18903] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2020-03-19T20:10:46Z | [18904] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down |
2020-03-19T20:11:03Z | [18905] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down |
2020-03-19T20:11:29Z | [18906] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-19T20:11:30Z | [18907] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-19T20:12:35Z | [18908] Firing 48 - IncreasedBackendConnectionErrors |
2020-03-19T20:12:50Z | [18909] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-19T20:39:07Z | [18913] Firing 2 - IncreasedErrorRateOtherBackends |
2020-03-19T20:39:11Z | [18915] Pingdom check check:https://gitlab.com/ is down |
2020-03-19T20:39:11Z | [18914] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2020-03-19T20:39:48Z | [18917] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down |
2020-03-19T20:40:03Z | [18918] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2020-03-19T20:40:16Z | [18919] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2020-03-19T20:40:37Z | [18920] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2020-03-19T20:40:47Z | [18921] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down |
2020-03-19T20:41:02Z | [18922] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down |
2020-03-19T20:41:14Z | [18923] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-19T20:41:16Z | [18924] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-19T20:42:06Z | [18925] Firing 48 - IncreasedBackendConnectionErrors |
2020-03-19T20:43:51Z | [18926] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-19T20:45:35Z | [18930] Firing 1 - Alertmanager is failing sending notifications |
2020-03-20T00:12:51Z | [18934] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-20T01:37:14Z | [18936] Firing 1 - Chef client failures have reached critical levels |
2020-03-20T04:03:06Z | [18937] Firing 1 - Alertmanager is failing sending notifications |
2020-03-20T04:03:07Z | [18938] Firing 1 - Alertmanager is failing sending notifications |
2020-03-20T05:25:51Z | [18939] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-20T19:08:41Z | [18953] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-21T00:09:23Z | [18962] Firing 1 - 5% disk space left |
2020-03-21T03:08:56Z | [18971] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-21T03:26:09Z | [18972] Firing 1 - 5% disk space left |
2020-03-21T06:33:50Z | [18975] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-21T11:05:10Z | [18981] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-22T01:15:11Z | [18998] Firing 1 - 5% disk space left |
2020-03-22T04:38:51Z | [19002] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-22T05:33:08Z | [19005] Firing 1 - 5% disk space left |
2020-03-22T10:58:06Z | [19012] Firing 1 - Gitaly error rate is too high: 16.35 |
2020-03-22T15:47:37Z | [19025] Firing 1 - Large number of overdue pull mirror jobs |
2020-03-23T00:35:53Z | [19042] Firing 1 - 5% disk space left |
2020-03-23T04:41:23Z | [19049] Firing 1 - 5% disk space left |
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 1
- Incident Issues : 17
- CorrectiveAction Issues : 0
Change Issues
- 2020-03-19T19:39:37Z - Drain traffic away from canary - skarbek
Incident Issues
- 2020-03-24T02:41:46Z - 2020-03-24 Sidekiq not meeting latency SLOs - cmiskell | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1816
- 2020-03-23T17:16:37Z - 2020-03-22: UploadsRewriter security issue - cmcfarland | ~S1 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1813
- 2020-03-22T18:13:30Z - 2020-03-22: Repo mirrors are slowly being processed - unassigned | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1806
- 2020-03-22T11:07:44Z - 2020-03-22: high error rate on one gitaly shard - craigf | ~S3 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1805
- 2020-03-22T05:52:53Z - Disk filling up on web-33-sv-gprd.c.gitlab-production.internal - ggillies | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1804
- 2020-03-20T18:36:22Z - 2020-03-20: LowDiskSpace on git nodes - cmcfarland | ~S4 | ServiceGit |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1803
- 2020-03-20T09:24:08Z - 2020-03-20: Registry pods ooming, not ready - craigf | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1801
- 2020-03-19T18:28:45Z - 2020-03-19: GitLab Appears to be throwing many 500s - cmcfarland | ~S1 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1798
- 2020-03-19T16:18:43Z - 2020-03-19: Sidekiq Apdex Score - cmcfarland | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1793
- 2020-03-19T10:09:51Z - 2020-03-19: redis-sidekiq single-threaded CPU saturation - unassigned | ~S4 | ServiceRedis |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1791
- 2020-03-18T21:24:17Z - 2020-03-18: thanos is restarting frequently - cmcfarland | ~S4 | ServiceThanos |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1787
- 2020-03-18T21:08:26Z - This is an incident exercise. - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1786
- 2020-03-18T10:23:07Z - 2020-03-18: pgbouncer async pool saturation high - craigf | ~S4 | ServicePgbouncer |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1784
- 2020-03-17T20:38:39Z - Postgres Replication lag is over 3 hours on archive recovery replica - cmcfarland | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1783
- 2020-03-17T14:50:29Z - 2020-03-17: A single API node has saturated puma_workers - cmcfarland | ~S4 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1781
- 2020-03-17T14:00:31Z - Inconsistencies between responses returned in Grafana, Prometheus and Thanos - bjk-gitlab | ~S4 | ServiceMonitoring |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1780
- 2020-03-17T09:07:07Z - 2020-03-17: dev.gitlab.org and ops.gitlab.net low disk(s) space - craigf | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1777
CorrectiveAction Issues
- 2020-03-18T22:21:00Z - Protect Gitaly from CPU saturation during many concurrent git-fetches - unassigned
- 2020-03-18T19:17:27Z - Consider a dedicated connection pool for deployments - unassigned
- 2020-03-18T14:50:59Z - Alert for PGBouncer pool saturation - unassigned
- 2020-03-18T09:24:33Z - Investigate why single application server nodes occasionally become saturated - craigf
Open Issue Stats
- Oncall issues : 2
- Change issues : 0
- Incident issues : 1
- Access Request : 4
- CorrectiveAction : 84
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-17T14:00:31Z | bjk-gitlab | Inconsistencies between responses returned in Grafana, Prometheus and Thanos |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant