OnCall report for period: 2020-03-31 - 2020-04-07
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Ahmad Sherif |
SRE 8 Hour | Alex Hanselka |
SRE 8 Hour | Craig Barrett |
SRE 8 Hour | Amar Amarsanaa |
SRE 8 Hour | Hendrik Meyer |
PagerDuty Incidents
* Number of incidents: **49**
Show/Hide Table
Created | Summary |
---|---|
2020-03-31T07:26:27Z | [19349] Firing 1 - Last successful WALE basebackup was seen 48.08388613707489 hours ago. |
2020-03-31T10:51:31Z | [19352] Firing 1 - Failed to collect Redis metrics Check the status of redis on redis-cache-01-db-gprd.c.gitlab-production.internal:9121 with gitlab-ctl status . |
2020-03-31T14:12:30Z | [19358] Firing 1 - Connection of Redis replicas to the master is flapping |
2020-03-31T15:39:01Z | [19362] Firing 1 - Failed to collect Redis metrics Check the status of redis on redis-cache-01-db-gprd.c.gitlab-production.internal:9121 with gitlab-ctl status . |
2020-03-31T19:18:58Z | [19368] Firing 1 - Chef client failures have reached critical levels |
2020-03-31T19:46:43Z | [19370] Firing 1 - Chef client failures have reached critical levels |
2020-04-01T11:37:51Z | [19401] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-04-01T12:09:58Z | [19403] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-01T16:17:33Z | [19414] Firing 1 - Large amount of Sidekiq Queued jobs: 305768 |
2020-04-01T21:07:58Z | [19435] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-01T21:41:55Z | [19439] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-01T22:36:06Z | [19442] Firing 2 - IncreasedErrorRateOtherBackends |
2020-04-01T23:08:05Z | [19444] Firing 2 - IncreasedErrorRateOtherBackends |
2020-04-02T00:09:58Z | [19448] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-02T03:48:19Z | [19455] Firing 1 - Large amount of Sidekiq Queued jobs: 51211 |
2020-04-02T04:32:29Z | [19456] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-02T06:12:56Z | [19457] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-02T08:54:19Z | [19460] Firing 1 - Less than 100% of sentinel processes running in the redis cluster |
2020-04-02T09:50:45Z | [19463] Firing 1 - Connection of Redis replicas to the master is flapping |
2020-04-02T10:07:45Z | [19464] Firing 1 - Less than 100% of sentinel processes running in the redis cluster |
2020-04-02T10:08:30Z | [19465] Firing 1 - Redis cluster redis is missing instances |
2020-04-02T12:10:00Z | [19467] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-02T15:11:58Z | [19474] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-02T21:09:58Z | [19487] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-02T21:13:58Z | [19488] Firing 1 - Chef client failures have reached critical levels |
2020-04-03T00:09:58Z | [19493] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-03T01:44:58Z | [19497] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-04-03T04:15:58Z | [19501] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-03T12:09:58Z | [19510] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-03T14:09:58Z | [19513] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-03T21:08:58Z | [19525] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-04T00:09:59Z | [19532] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-04T05:00:31Z | [19540] Firing 1 - Large amount of Sidekiq Queued jobs: 53900 |
2020-04-04T05:33:58Z | [19542] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-04-04T12:08:58Z | [19549] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-04T21:11:58Z | [19561] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-05T00:09:58Z | [19569] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-05T04:16:58Z | [19576] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-05T12:08:58Z | [19586] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-05T19:15:37Z | [19602] Firing 1 - Increased Error Rate Across Fleet |
2020-04-05T21:09:58Z | [19610] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-06T00:09:58Z | [19616] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-06T01:53:46Z | [19620] Firing 1 - Large amount of Sidekiq Queued jobs: 62901 |
2020-04-06T12:05:58Z | [19634] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-06T13:36:07Z | [19636] Firing 1 - HAProxy process high CPU usage on fe-registry-01-lb-gprd.c.gitlab-production.internal |
2020-04-06T14:20:58Z | [19640] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-06T14:38:25Z | [19641] Firing 1 - Gitaly: two versions of Gitaly have been running alongside one another in production for more than 30 minutes |
2020-04-06T21:08:58Z | [19656] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-04-07T00:10:00Z | [19659] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
7 Day Issue Stats
- Oncall issues : 2
- Access Request : 0
- Change Issues : 1
- Incident Issues : 9
- CorrectiveAction Issues : 0
Change Issues
- 2020-04-03T21:50:02Z - Create new gitaly storage shard node
file-48-stor-gprd
to replacefile-46-stor-gprd
in the configured rotation for storing new projects - nnelson
Incident Issues
- 2020-04-06T10:06:24Z - 2020-04-05: Cloudflare exporter is throwing errors - craigf | ~S4 | ServiceCloudflare |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1904
- 2020-04-06T01:57:30Z - Sidekiq SLO - authorized_projects - unassigned | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1903
- 2020-04-05T19:34:15Z - Increased Error Rate Across Web Fleet - ahanselka | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1901
- 2020-04-05T00:47:23Z - 2020-04-05 Web cny stage error rate SLO - unassigned | ~S3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1900
- 2020-04-04T05:01:51Z - Sidekiq SLO Alert - unassigned | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1899
- 2020-04-03T00:22:25Z - Sidekiq SLO alert - unassigned | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1891
- 2020-04-01T21:07:40Z - 2020-04-01 Security S1 / P1 from hackerone on artifacts endpoint - ahanselka | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1885
- 2020-04-01T16:31:54Z - 2020-04-01: Large amount of Sidekiq Queued jobs - ahanselka | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1884
- 2020-04-01T16:31:49Z - 2020-04-01: Large amount of Sidekiq Queued jobs - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1883
CorrectiveAction Issues
- 2020-04-06T11:04:52Z - Cloudflare: Remove HAProxy keepalive workaround - unassigned
- 2020-04-01T19:44:10Z - Support locally disabling chef-client on a host - msmiley
- 2020-04-01T15:01:36Z - Get Patroni hosts back to running with correct gitlab-replicator and chef running - unassigned
Open Issue Stats
- Oncall issues : 8
- Change issues : 2
- Incident issues : 2
- Access Request : 5
- CorrectiveAction : 85
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-04-03T21:50:02Z | nnelson | Create new gitaly storage shard node file-48-stor-gprd to replace file-46-stor-gprd in the configured rotation for storing new projects |
2020-01-14T23:08:17Z | nnelson | Migrate large projects off file-35-stor-gprd to file-45-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-22T05:52:53Z | ggillies | Disk filling up on web-33-sv-gprd.c.gitlab-production.internal |
2020-03-17T14:00:31Z | AnthonySandoval | Inconsistencies between responses returned in Grafana, Prometheus and Thanos |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-04-01T00:18:02Z | unassigned | Customer pipeline throwing error: broken pipe starting saturday. |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2020-03-30T09:26:36Z | hphilipps | RCA: 2020-03-30 Database failover and loss of sync to replicas |
2020-03-26T19:51:21Z | craig | Fix access scopes on postgres-dr-delayed-01-db-gprd |
2020-03-24T17:05:33Z | unassigned | Project deletion required |
2020-03-23T23:43:57Z | unassigned | Manually remove project |
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant