OnCall report for period: 2020-04-07 - 2020-04-14
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Alejandro Rodriguez |
SRE 8 Hour | Devin Sylva |
SRE 8 Hour | Craig Barrett |
SRE 8 Hour | Ben Kochie |
PagerDuty Incidents
* Number of incidents: **52**
Show/Hide Table
Created | Summary |
---|---|
2020-04-07T08:47:29Z | [19680] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-07T11:20:30Z | [19682] Firing 1 - Chef client failures have reached critical levels |
2020-04-07T12:08:58Z | [19685] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-07T12:14:28Z | [19686] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-07T15:12:58Z | [19691] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-07T19:20:43Z | [19703] Firing 1 - Chef client failures have reached critical levels |
2020-04-07T19:30:25Z | [19705] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-07T20:54:56Z | [19707] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-04-07T21:09:58Z | [19709] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-07T22:17:10Z | [19713] Firing 1 - Gitaly: multiple versions of Gitaly are currently running in production |
2020-04-07T22:54:26Z | [19715] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-08T00:09:58Z | [19718] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-08T06:54:42Z | [19729] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-08T12:08:58Z | [19741] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-08T13:47:12Z | [19744] Firing 1 - PostgreSQL dead tuples is too large |
2020-04-08T13:59:36Z | [19745] Firing 1 - Increased Error Rate Across Fleet |
2020-04-08T14:01:58Z | [19747] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-04-08T14:05:50Z | [19748] Firing 1 - Increased Error Rate Across Fleet |
2020-04-08T14:05:53Z | [19749] Firing 3 - ProcessCommitWorkersTooHigh |
2020-04-08T14:18:50Z | [19751] Firing 2 - AlertmanagerNotificationsFailing |
2020-04-08T14:19:05Z | [19752] Firing 2 - AlertmanagerNotificationsFailing |
2020-04-08T14:19:20Z | [19753] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-08T14:29:16Z | [19755] Firing 1 - Large amount of Sidekiq Queued jobs: 273126 |
2020-04-08T14:53:43Z | [19756] Firing 1 - Chef client failures have reached critical levels |
2020-04-08T14:54:55Z | [19757] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-08T14:55:24Z | [19758] Firing 1 - HAProxy process high CPU usage on fe-registry-02-lb-gprd.c.gitlab-production.internal |
2020-04-08T14:58:58Z | [19760] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-08T15:02:58Z | [19761] GitLab.com Incident needs further communication |
2020-04-08T21:01:40Z | [19779] Firing 1 - Gitaly: multiple versions of Gitaly are currently running in production |
2020-04-08T21:44:56Z | [19781] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-08T23:05:59Z | [19786] Practice incident 2020-04-08 |
2020-04-08T23:06:00Z | [19787] Practice incident 2020-04-08 |
2020-04-08T23:08:42Z | [19788] Firing 1 - Last WALE backup was seen 20m 7s ago. |
2020-04-09T00:09:58Z | [19789] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-09T03:25:43Z | [19794] Firing 1 - Chef client failures have reached critical levels |
2020-04-09T12:08:58Z | [19815] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-09T13:05:40Z | [19817] Firing 1 - Gitaly: multiple versions of Gitaly are currently running in production |
2020-04-09T13:49:55Z | [19818] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-09T14:27:43Z | [19820] Firing 1 - sentry.gitlab.net is down |
2020-04-09T14:45:25Z | [19824] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-09T15:10:58Z | [19826] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-09T19:37:50Z | [19832] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-04-09T21:09:58Z | [19835] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-09T22:11:58Z | [19837] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-10T00:09:58Z | [19838] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-10T04:15:58Z | [19840] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-10T15:29:21Z | [19844] Firing 1 - Increased Error Rate Across Fleet |
2020-04-10T23:39:31Z | [19847] Firing 1 - Redis Switch Master |
2020-04-11T18:41:42Z | [19854] Firing 1 - Last WALE backup was seen 20m 6s ago. |
2020-04-13T13:24:58Z | [19876] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-04-13T16:55:15Z | [19878] Firing 1 - Last WALE backup was seen 20m 0s ago. |
2020-04-13T21:02:05Z | [19880] Firing 8 - GitalyInstanceErrorRateTooHigh |
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 1
- Incident Issues : 6
- CorrectiveAction Issues : 0
Change Issues
- 2020-04-08T13:47:02Z - Create new gitaly storage shard node
file-49-stor-gprd
to replacefile-43-stor-gprd
in the configured rotation for storing new projects - nnelson
Incident Issues
- 2020-04-08T23:05:31Z - Practice incident 2020-04-08 - unassigned | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1923
- 2020-04-08T14:11:31Z - Increased error rates on gitlab.com - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1922
- 2020-04-08T14:11:31Z - Increased error rates on gitlab.com - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1921
- 2020-04-08T14:10:46Z - Service degradation - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1920
- 2020-04-08T14:10:26Z - Google Cloud Platform Issue - IAM Failures - AnthonySandoval | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1919
- 2020-04-07T16:22:07Z - 2020-04-07: chef.gitlab.com chef-client failures - craig | ~S4 | ServiceInfrastructure |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1911
CorrectiveAction Issues
Open Issue Stats
- Oncall issues : 7
- Change issues : 0
- Incident issues : 1
- Access Request : 5
- CorrectiveAction : 85
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-17T14:00:31Z | AnthonySandoval | Inconsistencies between responses returned in Grafana, Prometheus and Thanos |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-04-03T13:33:38Z | unassigned | GitLab.com project will not export |
2020-04-03T12:59:54Z | unassigned | Help exporting GitLab.com project |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2020-03-30T09:26:36Z | hphilipps | RCA: 2020-03-30 Database failover and loss of sync to replicas |
2020-03-23T23:43:57Z | unassigned | Manually remove project |
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant