OnCall report for period: 2020-05-26 - 2020-06-02
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Ahmad Sherif |
SRE 8 Hour | Alejandro Rodriguez |
SRE 8 Hour | Henri Philipps |
SRE 8 Hour | Craig Miskell |
SRE 8 Hour | Graeme Gillies |
PagerDuty Incidents
* Number of incidents: **35**
Show/Hide Table
Created | Summary |
---|---|
2020-05-26T07:48:14Z | [21084] Firing 1 - Last WALE backup was seen 20m 4s ago. |
2020-05-27T13:17:23Z | [21111] Firing 1 - The public dashboard page is down |
2020-05-27T17:03:24Z | [21118] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-27T20:34:12Z | [21122] Firing 1 - The sidekiq service (main stage) has an error-ratio exceeding SLO |
2020-05-28T08:02:42Z | [21136] Firing 1 - Last WALE backup was seen 59d 3h 56m 2s ago. |
2020-05-28T11:52:11Z | [21137] Firing 1 - Unused Replication Slots for patroni-11-db-gprd.c.gitlab-production.internal |
2020-05-28T14:36:21Z | [21140] Firing 1 - Increased Error Rate Across Fleet |
2020-05-28T23:16:44Z | [21159] Firing 1 - Chef client failures have reached critical levels |
2020-05-29T05:16:35Z | [21167] Firing 1 - Increased Error Rate Across Fleet |
2020-05-29T05:16:36Z | [21168] Firing 1 - High Error Rate on Front End Web |
2020-05-29T05:18:14Z | [21169] Pingdom check check:https://gitlab.com/ is down |
2020-05-29T05:21:12Z | [21172] Firing 1 - The waf service, gitlab_zone component, main stage, has an error burn-rate exceeding SLO |
2020-05-29T08:59:56Z | [21181] Firing 2 - PrometheusManyRestarts |
2020-05-29T08:59:56Z | [21182] Firing 2 - PrometheusManyRestarts |
2020-05-29T09:07:55Z | [21183] 401s on user actions from the MR page |
2020-05-29T11:19:17Z | [21185] Firing 1 - SSL certificate for https://about-src.gitlab.com expires in 23h 29m 52s |
2020-05-29T19:19:31Z | [21205] Firing 1 - SSL certificate for https://about-src.gitlab.com expires in 15h 29m 52s |
2020-05-30T03:42:13Z | [21235] Firing 1 - dev.gitlab.org is returning errors for 10m |
2020-05-30T04:33:22Z | [21240] Firing 1 - prometheus is restarting frequently |
2020-05-30T04:34:29Z | [21241] Firing 1 - Chef client failures have reached critical levels |
2020-05-30T04:52:10Z | [21242] dev.gitlab.org is down |
2020-05-30T04:53:58Z | [21243] Firing 1 - Chef client failures have reached critical levels |
2020-05-30T07:25:53Z | [21245] Firing 1 - 5% disk space left |
2020-05-30T08:02:53Z | [21248] Firing 1 - 5% disk space left |
2020-05-30T08:41:09Z | [21249] Firing 1 - 5% disk space left |
2020-05-30T14:27:20Z | [21256] Firing 2 - IncreasedBackendConnectionErrors |
2020-05-31T07:22:53Z | [21280] Firing 1 - 5% disk space left |
2020-05-31T07:59:25Z | [21281] Firing 1 - 5% disk space left |
2020-05-31T08:36:38Z | [21284] Firing 1 - 5% disk space left |
2020-06-01T07:29:23Z | [21314] Firing 1 - 5% disk space left |
2020-06-01T08:45:38Z | [21317] Firing 1 - 5% disk space left |
2020-06-01T10:29:51Z | [21326] Firing 1 - Gitaly latency on file-praefect-01-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-06-01T12:39:36Z | [21334] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-06-01T12:39:42Z | [21335] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-06-01T19:17:51Z | [21369] Firing 1 - Large number of overdue pull mirror jobs |
7 Day Issue Stats
- Oncall issues : 1
- Access Request : 0
- Change Issues : 4
- Incident Issues : 7
- CorrectiveAction Issues : 0
Change Issues
- 2020-05-30T00:43:55Z - Repository migration on gitlab.com - unassigned
- 2020-05-29T18:29:28Z - Repository migration on gitlab.com - unassigned
- 2020-05-29T15:33:22Z - Repository migration on gitlab.com - unassigned
- 2020-05-28T15:01:09Z - Patroni replica restart and Primary switchover - hphilipps
Incident Issues
- 2020-06-01T08:22:32Z - Client body buffering running out of space on API fleet since Saturday morning - ahmadsherif | ~S4 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2211
- 2020-05-30T14:45:25Z - 2020-05-30: A spike of requests to a single Pages site - ahmadsherif | ~S4 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2208
- 2020-05-30T04:52:09Z - 2020-05-30: dev.gitlab.org is down - ggillies | ~S3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2207
- 2020-05-29T09:07:54Z - 401s on user actions from the MR page - skarbek | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2203
- 2020-05-29T05:21:12Z - 2020-05-29: gitlab.com is down - ggillies | ~S1 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2201
- 2020-05-27T22:38:02Z - 2020-05-27: The
sidekiq
service (main
stage) has a apdex score - unassigned | ~S3 | ServiceSidekiq |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2192
- 2020-05-27T08:57:42Z - Investigate Atlassian IPs being blocked by Cloudflare - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2191
CorrectiveAction Issues
- 2020-06-01T23:42:37Z - Enable rate limiting per IP on
/users/sign_in
via Cloudflare - ggillies - 2020-06-01T22:48:23Z - Disable NGINX request buffering for artifact uploads - unassigned
- 2020-05-27T20:03:16Z - Create MR Checklist for Cloudflare Terraform changes - unassigned
- 2020-05-27T19:56:42Z - Improvements to Cloudflare Terraform - unassigned
Open Issue Stats
- Oncall issues : 4
- Change issues : 2
- Incident issues : 6
- Access Request : 4
- CorrectiveAction : 97
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-05-28T15:01:09Z | hphilipps | Patroni replica restart and Primary switchover |
2020-03-26T19:16:25Z | nnelson | Rotate credentials for user gitlab-superuser
|
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-01T08:22:32Z | ahmadsherif | Client body buffering running out of space on API fleet since Saturday morning |
2020-05-30T04:52:09Z | ggillies | 2020-05-30: dev.gitlab.org is down |
2020-05-29T09:07:54Z | skarbek | 401s on user actions from the MR page |
2020-05-29T05:21:12Z | ggillies | 2020-05-29: gitlab.com is down |
2020-05-27T08:57:42Z | unassigned | Investigate Atlassian IPs being blocked by Cloudflare |
2020-05-14T11:12:17Z | unassigned | Degraded performance on shared CI runners |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-05-28T17:17:44Z | unassigned | Import request (for harvie-farm): harvie |
2020-05-25T05:05:45Z | albertoramos | Archived repository missing |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant