OnCall report for period: 2020-06-09 - 2020-06-16
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Alex Hanselka |
SRE 8 Hour | Craig Barrett |
SRE 8 Hour | Amar Amarsanaa |
SRE 8 Hour | Hendrik Meyer |
SRE 8 Hour | Craig Furman |
PagerDuty Incidents
* Number of incidents: **21**
Show/Hide Table
Created | Summary |
---|---|
2020-06-09T07:50:50Z | [21890] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-09T08:20:50Z | [21891] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-09T09:34:21Z | [21892] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-09T18:54:53Z | [21896] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-09T22:49:37Z | [21901] Open registrations on Review apps + CI/CD has resulted in K8s service account compromise |
2020-06-10T12:20:08Z | [21906] Firing 1 - thanos is restarting frequently |
2020-06-10T14:14:21Z | [21909] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-10T16:42:54Z | [21911] Firing 1 - High number of inode usage |
2020-06-10T21:17:21Z | [21913] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-10T23:28:13Z | [21914] Firing 1 - Chef client failures have reached critical levels |
2020-06-12T07:14:50Z | [21920] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-13T23:14:41Z | [21924] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2020-06-13T23:25:40Z | [21925] Firing 1 - |
2020-06-13T23:37:20Z | [21927] Firing 1 - Increased Error Rate Across Fleet |
2020-06-15T11:48:06Z | [21929] Firing 1 - HPA unable to scale up |
2020-06-15T13:19:12Z | [21930] Firing 1 - Last WALE backup was seen 20m 11s ago. |
2020-06-15T13:31:30Z | [21931] Firing 1 - Less than 100% of sentinel processes running in the redis-cache cluster |
2020-06-15T14:57:12Z | [21932] Firing 1 - Last WALE backup was seen 20m 2s ago. |
2020-06-15T19:46:20Z | [21933] Firing 1 - Increased Error Rate Across Fleet |
2020-06-15T21:48:57Z | [21937] Firing 1 - Last WALE backup was seen 20m 13s ago. |
2020-06-16T05:21:09Z | [21939] Firing 1 - Large amount of Sidekiq Queued jobs |
7 Day Issue Stats
- Oncall issues : 2
- Access Request : 0
- Change Issues : 9
- Incident Issues : 10
- CorrectiveAction Issues : 0
Change Issues
- 2020-06-12T09:25:28Z - Repository migration on gitlab.com (nfs-file09) - unassigned
- 2020-06-12T09:25:17Z - Repository migration on gitlab.com (nfs-file08) - unassigned
- 2020-06-12T09:25:06Z - Repository migration on gitlab.com (nfs-file07) - unassigned
- 2020-06-12T09:24:55Z - Repository migration on gitlab.com (nfs-file06) - unassigned
- 2020-06-12T09:24:43Z - Repository migration on gitlab.com (nfs-file05) - unassigned
- 2020-06-11T21:39:12Z - Repository migration on gitlab.com (nfs-file04) - unassigned
- 2020-06-11T20:51:56Z - Repository migration on gitlab.com (nfs-file03) - unassigned
- 2020-06-11T20:31:04Z - Repository migration on gitlab.com (nfs-file02) - unassigned
- 2020-06-10T16:44:46Z - Repository migration on gitlab.com (nfs-file01) - unassigned
Incident Issues
- 2020-06-15T19:59:47Z - 2020-06-15 -DDoS caused spike in web error rates - unassigned | ~S3 | ServiceHAProxy |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2278
- 2020-06-13T23:37:56Z - 06-13-2020: Error rate across the fleet - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2273
- 2020-06-11T08:09:31Z - 2020-06-11: Stale caches for issue and MR counts - craigf | ~S4 | ServiceRedis |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2263
- 2020-06-10T13:53:13Z - 2020-06-10: Elevated web latency - ahanselka | ~S2 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2259
- 2020-06-10T08:11:10Z - 2020-06-10: Searching issues through API fails with error 500 - craigf | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2257
- 2020-06-09T21:35:09Z - 2020-06-09: Repo Mirror User fallback leaking CI Job token - mdelaossa | ~S1 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2256
- 2020-06-09T12:18:11Z - 2020-06-09: Merge train isn’t succeeding for gitlab-com/runbooks - jarv | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2253
- 2020-06-09T11:27:23Z - 2020-06-09: post-deployment migration failure - nolith | ~S4 | ServiceGitLab Rails |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2251
- 2020-06-09T08:10:12Z - 2020-06-09: Delayed pull mirrors - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2250
- 2020-06-09T08:10:03Z - 2020-06-09: Delayed pull mirrors - craigf | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2249
CorrectiveAction Issues
Open Issue Stats
- Oncall issues : 5
- Change issues : 8
- Incident issues : 12
- Access Request : 4
- CorrectiveAction : 96
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-12T09:25:28Z | unassigned | Repository migration on gitlab.com (nfs-file09) |
2020-06-12T09:25:17Z | unassigned | Repository migration on gitlab.com (nfs-file08) |
2020-06-12T09:25:06Z | unassigned | Repository migration on gitlab.com (nfs-file07) |
2020-06-12T09:24:55Z | unassigned | Repository migration on gitlab.com (nfs-file06) |
2020-06-12T09:24:43Z | unassigned | Repository migration on gitlab.com (nfs-file05) |
2020-06-11T20:51:56Z | unassigned | Repository migration on gitlab.com (nfs-file03) |
2020-06-08T20:27:22Z | nnelson | Create new gitaly storage shard node file-53-stor-gprd to replace file-42-stor-gprd in the configured rotation for storing new projects |
2020-03-26T19:16:25Z | alejandro | Rotate credentials for user gitlab-superuser
|
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-15T19:59:47Z | unassigned | 2020-06-15 -DDoS caused spike in web error rates |
2020-06-10T13:53:13Z | ahanselka | 2020-06-10: Elevated web latency |
2020-06-09T21:35:09Z | mdelaossa | 2020-06-09: Repo Mirror User fallback leaking CI Job token |
2020-06-09T11:27:23Z | nolith | 2020-06-09: post-deployment migration failure |
2020-06-08T04:08:25Z | unassigned | 2020-06-08 High rate of canary errors: DDoS |
2020-06-05T13:39:49Z | unassigned | 2020-06-05: increased error rates on the web service |
2020-06-05T07:58:05Z | unassigned | 2020-06-05: surge in authorized_project_update jobs is saturating catchall workers |
2020-06-04T03:17:59Z | cmiskell | 2020-06-04 Large load spike on API fleet causing response degradation |
2020-06-01T08:22:32Z | ahmadsherif | Client body buffering running out of space on API fleet since Saturday morning |
2020-05-30T04:52:09Z | ggillies | 2020-05-30: dev.gitlab.org is down |
2020-05-29T09:07:54Z | nolith | 2020-05-29: HTTP 401s on various components of the GitLab UI |
2020-05-29T05:21:12Z | ggillies | 2020-05-29: gitlab.com is down |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-15T19:36:58Z | unassigned | Import Request for (metrikus): amv3_app |
2020-06-10T19:02:48Z | unassigned | Import request (for alex-solutions/core): alex-app |
2020-05-25T05:05:45Z | albertoramos | Archived repository missing |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant
Edited by Dave Smith