OnCall report for period: 2020-06-02 - 2020-06-09
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Amar Amarsanaa |
SRE 8 Hour | Michal Wasilewski |
SRE 8 Hour | Craig Miskell |
SRE 8 Hour | Matt Smiley |
PagerDuty Incidents
* Number of incidents: **21**
Show/Hide Table
Created | Summary |
---|---|
2020-06-03T15:17:50Z | [21505] Firing 1 - Increased Error Rate Across Fleet |
2020-06-03T17:24:30Z | [21510] Firing 1 - Less than 100% of sentinel processes running in the redis cluster |
2020-06-03T23:09:21Z | [21529] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-04T02:51:57Z | [21544] Firing 1 - Last WALE backup was seen 20m 0s ago. |
2020-06-04T03:16:35Z | [21548] Firing 2 - IncreasedErrorRateOtherBackends |
2020-06-04T13:03:28Z | [21586] DNS for new domain not working |
2020-06-05T06:47:23Z | [21648] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-05T07:48:08Z | [21656] Firing 1 - Large number of overdue pull mirror jobs |
2020-06-05T13:33:30Z | [21670] Firing 1 - Increased Error Rate Across Fleet |
2020-06-05T16:50:42Z | [21685] Firing 1 - Last WALE backup was seen 67d 12h 44m 2s ago. |
2020-06-06T12:00:52Z | [21745] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-06-06T12:09:52Z | [21746] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-06-08T03:53:35Z | [21848] Firing 1 - Increased Error Rate Across Fleet |
2020-06-08T03:54:05Z | [21849] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2020-06-08T03:55:45Z | [21851] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down |
2020-06-08T03:56:12Z | [21852] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down |
2020-06-08T03:59:26Z | [21856] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2020-06-08T04:00:22Z | [21857] Firing 1 - The waf service, gitlab_zone component, main stage, has an error burn-rate exceeding SLO |
2020-06-08T04:00:40Z | [21858] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2020-06-08T04:00:55Z | [21860] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2020-06-09T00:24:40Z | [21889] Pingdom check check:https://license.gitlab.com/users/sign_in is down |
7 Day Issue Stats
- Oncall issues : 3
- Access Request : 0
- Change Issues : 4
- Incident Issues : 15
- CorrectiveAction Issues : 0
Change Issues
- 2020-06-08T22:05:46Z - Migrate large projects off file-42-stor-gprd to file-02-stor-gprd - nnelson
- 2020-06-08T20:27:22Z - Create new gitaly storage shard node
file-53-stor-gprd
to replacefile-42-stor-gprd
in the configured rotation for storing new projects - nnelson - 2020-06-05T11:29:37Z - Repository migration on gitlab.com (nfs-file10) - glopezfernandez
- 2020-06-03T13:17:23Z - Repository migration on gitlab.com (nfs-file27) - unassigned
Incident Issues
- 2020-06-08T05:09:46Z - 2020-06-08 authorized_projects spike - unassigned | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2243
- 2020-06-08T04:08:25Z - 2020-06-08 High rate of canary errors: DDoS - unassigned | ~S3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2242
- 2020-06-06T23:38:05Z - authorized_projects queuing - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2241
- 2020-06-05T13:39:49Z - increased error rates on the web service - unassigned | ~S2 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2239
- 2020-06-05T07:58:05Z - 2020-06-05: surge in authorized_project_update jobs is saturating catchall workers - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2236
- 2020-06-05T07:11:27Z - 2020-06-05 Authorized_project job spike delayed pull mirrors - cmiskell | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2235
- 2020-06-04T15:32:53Z - [TEST]The Sidekiq service is not meeting its latency SLOs - mwasilewski-gitlab | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2231
- 2020-06-04T13:03:27Z - 2020-06-04: DNS for new domain not working - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2230
- 2020-06-04T03:18:12Z - Large load spike on API fleet causing response degradation - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2226
- 2020-06-04T03:18:09Z - Large load spike on API fleet causing response degradation - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2225
- 2020-06-04T03:18:00Z - Large load spike on API fleet causing response degradation - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2224
- 2020-06-04T03:17:59Z - 2020-06-04 Large load spike on API fleet causing response degradation - cmiskell | ~S2 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2223
- 2020-06-04T03:02:06Z - Mtail stuck on patroni-11 - unassigned | ~S4 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2222
- 2020-06-03T23:17:28Z - 2020-06-03: Sidekiq delays - catchall fleet - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2221
- 2020-06-03T12:56:27Z - Brief web latency increase caused by a short-lived increase in cpu utilization on database nodes - unassigned | ~S3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2218
CorrectiveAction Issues
- 2020-06-08T18:07:37Z - Observing/Alerting on influx of HTTP401's - unassigned
- 2020-06-04T08:47:19Z - Add ES7_URL_WITH_CREDS to runbooks and to 1pass - mwasilewski-gitlab
- 2020-06-04T08:09:24Z - Fielddata errors on
pubsub-rails-inf-gprd-003051
index making rails production logs partially not searchable - unassigned
Open Issue Stats
- Oncall issues : 4
- Change issues : 4
- Incident issues : 10
- Access Request : 4
- CorrectiveAction : 99
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-08T22:05:46Z | nnelson | Migrate large projects off file-42-stor-gprd to file-02-stor-gprd |
2020-06-08T20:27:22Z | nnelson | Create new gitaly storage shard node file-53-stor-gprd to replace file-42-stor-gprd in the configured rotation for storing new projects |
2020-05-28T15:01:09Z | hphilipps | Patroni replica restart and Primary switchover |
2020-03-26T19:16:25Z | alejandro | Rotate credentials for user gitlab-superuser
|
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-08T04:08:25Z | unassigned | 2020-06-08 High rate of canary errors: DDoS |
2020-06-05T13:39:49Z | unassigned | increased error rates on the web service |
2020-06-05T07:58:05Z | unassigned | 2020-06-05: surge in authorized_project_update jobs is saturating catchall workers |
2020-06-04T13:03:27Z | unassigned | 2020-06-04: DNS for new domain not working |
2020-06-04T03:17:59Z | cmiskell | 2020-06-04 Large load spike on API fleet causing response degradation |
2020-06-01T08:22:32Z | ahmadsherif | Client body buffering running out of space on API fleet since Saturday morning |
2020-05-30T04:52:09Z | ggillies | 2020-05-30: dev.gitlab.org is down |
2020-05-29T09:07:54Z | nolith | 2020-05-29: HTTP 401s on various components of the GitLab UI |
2020-05-29T05:21:12Z | ggillies | 2020-05-29: gitlab.com is down |
2020-05-14T11:12:17Z | unassigned | Degraded performance on shared CI runners |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-06-08T10:34:47Z | unassigned | Import request (for strattic-com): development/strattic-code |
2020-05-25T05:05:45Z | albertoramos | Archived repository missing |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant