OnCall report for period: 2020-02-25 - 2020-03-03
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Alejandro Rodriguez |
SRE 8 Hour | Devin Sylva |
SRE 8 Hour | Craig Barrett |
SRE 8 Hour | Ben Kochie |
PagerDuty Incidents
* Number of incidents: **25**
Show/Hide Table
Created | Summary |
---|---|
2020-02-25T18:59:18Z | [17888] Firing 1 - CPU use percent is extremely high on fe-17-lb-gprd.c.gitlab-production.internal for the past 2 hours. |
2020-02-25T19:47:56Z | [17889] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-02-25T23:29:13Z | [17892] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-02-26T00:30:38Z | [17893] Firing 1 - SSL certificate for https://packages.gitlab.com expires in 23h 29m 58s |
2020-02-26T15:11:18Z | [17908] Firing 1 - CPU use percent is extremely high on fe-24-lb-gprd.c.gitlab-production.internal for the past 2 hours. |
2020-02-26T15:16:21Z | [17909] Firing 1 - Alertmanager is failing sending notifications |
2020-02-26T15:31:23Z | [17911] Firing 1 - Alertmanager is failing sending notifications |
2020-02-26T15:36:22Z | [17912] Firing 2 - AlertmanagerNotificationsFailing |
2020-02-27T11:38:15Z | [17944] AWS Key Compromis - Coin Mining |
2020-02-28T03:11:14Z | [17979] Firing 1 - Last WALE backup was seen 18m 44s ago. |
2020-02-28T14:09:14Z | [17990] Firing 1 - CPU use percent is extremely high on fe-24-lb-gprd.c.gitlab-production.internal for the past 2 hours. |
2020-02-29T08:14:11Z | [18006] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-02-29T09:56:00Z | [18011] Firing 1 - Connection of Redis replicas to the master is flapping |
2020-02-29T15:02:14Z | [18024] S1/P1 RCE found impacting Gitlab.com |
2020-03-01T07:37:15Z | [18045] Firing 1 - CPU use percent is extremely high on influxdb-01-inf-gprd.c.gitlab-production.internal for the past 2 hours. |
2020-03-01T20:34:56Z | [18066] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2020-03-02T10:45:12Z | [18081] Firing 1 - sentry.gitlab.net is down |
2020-03-02T16:15:21Z | [18087] Firing 1 - Alertmanager is failing sending notifications |
2020-03-02T16:15:50Z | [18088] Firing 1 - Alertmanager is failing sending notifications |
2020-03-02T23:03:58Z | [18089] Firing 1 - Chef client failures have reached critical levels |
2020-03-02T23:09:30Z | [18091] Firing 1 - Chef client failures have reached critical levels |
2020-03-02T23:29:31Z | [18092] Firing 8 - ChefClientErrorCritical |
2020-03-02T23:33:58Z | [18094] Firing 25 - ChefClientErrorCritical |
2020-03-03T03:08:15Z | [18099] Firing 1 - CPU use percent is extremely high on pubsub-workhorse-inf-gprd.c.gitlab-production.internal for the past 2 hours. |
2020-03-03T05:32:18Z | [18100] Firing 1 - CPU use percent is extremely high on pubsub-workhorse-inf-gprd.c.gitlab-production.internal for the past 2 hours. |
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 0
- Incident Issues : 7
- CorrectiveAction Issues : 0
Change Issues
Incident Issues
- 2020-02-29T16:47:36Z - 2020-02-29: S1/P1 security incident - unassigned | ~S1 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1718
- 2020-02-28T15:17:08Z - 2020-02-28: haproxy is saturating cpu on fe-[17,24] - alejandro | ~S3 | ServiceHAProxy |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1716
- 2020-02-27T15:34:39Z - 2020-02-27 The web service, workhorse component, main stage, has an error burn-rate exceeding SLO - alejandro | ~S3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1712
- 2020-02-26T22:52:31Z - Unable to push to repositories: rpc error: code = Unknown desc = invalid correlation ID - unassigned | | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1710
- 2020-02-26T15:45:35Z - 2020-02-26 Alertmanager is failing sending notifications - alejandro | ~S4 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1707
- 2020-02-25T20:17:07Z - 2020-02-25: Postgres Replication lag is over 3 hours on archive recovery replica - unassigned | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1703
- 2020-02-25T19:05:34Z - 2020-02-25: CPU use percent extremely high on fe-17-lb-gprd.c.gitlab-production.internal - alejandro | ~S3 | ServiceHAProxy |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1702
CorrectiveAction Issues
- 2020-02-28T20:16:56Z - Document Cloud Provider Abuse Reporting Procedures - unassigned
- 2020-02-26T17:11:36Z - replay in staging performance tests that led to high cpu on redis - unassigned
- 2020-02-25T18:09:06Z - Consider using RedisInsight for better monitoring and profiling of Redis - unassigned
Open Issue Stats
- Oncall issues : 5
- Change issues : 2
- Incident issues : 2
- Access Request : 5
- CorrectiveAction : 71
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-10-16T14:37:43Z | nnelson | Migrate large projects off file-33-stor-gprd to file-43-stor-gprd |
2019-10-15T15:13:30Z | nnelson | Migrate large projects off file-34-stor-gprd to file-44-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-02-29T16:47:36Z | unassigned | 2020-02-29: S1/P1 security incident |
2020-02-25T20:17:07Z | unassigned | 2020-02-25: Postgres Replication lag is over 3 hours on archive recovery replica |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-02-13T23:56:43Z | unassigned | Import request (for red61): via-server |
2020-02-12T16:00:37Z | dawsmith | dev.gitlab.org - Admins Export |
2020-01-16T06:07:03Z | aamarsanaa | Incremental rollout for the Pages new API based config source |
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant