OnCall report for period: 2020-02-11 - 2020-02-18
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Henri Philipps |
SRE 8 Hour | Craig Miskell |
SRE 8 Hour | Matt Smiley |
SRE 8 Hour | Nels Nelson |
SRE 8 Hour | Graeme Gillies |
PagerDuty Incidents
* Number of incidents: **26**
Show/Hide Table
Created | Summary |
---|---|
2020-02-11T06:02:26Z | [17304] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-11T07:16:12Z | [17306] Firing 11 - PostgreSQL_ReplicaStaleXmin |
2020-02-11T08:49:56Z | [17308] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2020-02-11T08:51:20Z | [17310] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2020-02-11T08:51:58Z | [17312] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down |
2020-02-11T08:53:40Z | [17313] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2020-02-11T09:13:31Z | [17317] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2020-02-11T12:26:49Z | [17322] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down |
2020-02-11T13:40:51Z | [17327] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-11T14:24:46Z | [17328] Firing 1 - Chef client failures have reached critical levels |
2020-02-11T14:47:00Z | [17330] Firing 2 - ChefClientErrorCritical |
2020-02-11T15:04:29Z | [17331] Pingdom check check:https://snowplow.trx.gitlab.net/health is down |
2020-02-11T16:05:42Z | [17335] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-02-11T18:55:13Z | [17344] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-02-12T17:45:38Z | [17381] Firing 1 - Chef client failures have reached critical levels |
2020-02-12T21:25:06Z | [17384] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2020-02-12T21:45:04Z | [17385] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2020-02-13T14:14:58Z | [17396] Firing 1 - Postgres is generating XLOG too fast, expect this to cause replication lag |
2020-02-13T15:26:38Z | [17397] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2020-02-15T22:52:56Z | [17455] Support is receiving multiple reports of stalled CI pipelines |
2020-02-16T01:38:43Z | [17465] Firing 1 - Last WALE backup was seen 2h 30m 0s ago. |
2020-02-16T08:49:34Z | [17481] Handover deploying a patch to gprd (sidekiq) |
2020-02-17T11:43:03Z | [17521] Firing 1 - Large amount of Sidekiq Queued jobs: 59735 |
2020-02-17T13:38:27Z | [17527] Firing 1 - Postgres is generating XLOG too fast, expect this to cause replication lag |
2020-02-17T13:50:28Z | [17529] Firing 1 - Postgres is generating XLOG too fast, expect this to cause replication lag |
2020-02-18T01:25:51Z | [17565] Firing 1 - Large number of overdue pull mirror jobs |
7 Day Issue Stats
- Oncall issues : 3
- Access Request : 0
- Change Issues : 0
- Incident Issues : 9
- CorrectiveAction Issues : 1
Change Issues
Incident Issues
- 2020-02-18T01:29:51Z - 2020-02-18 Large number of overdue pull mirror jobs - cmiskell | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1675
- 2020-02-18T01:22:34Z - 2020-02-18 The sidekiq service (main stage) has a apdex score (latency) below SLO - cmiskell | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1674
- 2020-02-16T01:52:31Z - 2020-02-16: Last WALE backup was seen 2h 30m 0s ago. - cmiskell | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1672
- 2020-02-15T22:44:59Z - 2020-02-15: Support is receiving multiple reports of stalled CI pipelines - nnelson | ~S2 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1671
- 2020-02-13T01:50:44Z - 2020-02-12: The elastic_indexer Sidekiq queue (main stage) is not meeting its latency SLOs - unassigned | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1661
- 2020-02-12T21:47:59Z - Triggered #17384: Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down - alejandro | ~S3 | ServicePraefect |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1659
- 2020-02-12T17:56:51Z - 2020-02-12: Chef client failures have reached critical levels - nnelson | ~S4 | ServiceMonitoring |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1657
- 2020-02-11T16:43:58Z - 2020-02-11: Postgres Replication lag is over 3 hours on archive recovery replica - nnelson | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1653
- 2020-02-11T07:45:23Z - 2020-02-11: High insert rate for services table causing higher web latencies - hphilipps | ~S1 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1651
CorrectiveAction Issues
- 2020-02-13T15:54:22Z - Upgrade machine type of praefect nodes in production - alejandro
- 2020-02-12T14:41:20Z - Audit our DNS A records for outdated entries on gitlab.net - unassigned
Open Issue Stats
- Oncall issues : 12
- Change issues : 1
- Incident issues : 1
- Access Request : 5
- CorrectiveAction : 70
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-10-16T14:37:43Z | nnelson | Migrate large projects off file-33-stor-gprd to file-43-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-02-10T09:33:02Z | cmcfarland | 2020-02-10: Spawn timeouts and Gitaly errors on file-cny-01 |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-02-14T20:44:47Z | unassigned | Feature Flag: single_mr_diff_view default to true |
2020-02-13T23:56:43Z | cmcfarland | Import request (for red61): via-server |
2020-02-12T16:00:37Z | unassigned | dev.gitlab.org - Admins Export |
2020-02-03T09:38:06Z | unassigned | Research possibility of creating/using sre-oncall group within GitLab |
2020-01-30T21:26:03Z | unassigned | Update environment variables for customers.stg.gitlab.com |
2020-01-16T06:07:03Z | aamarsanaa | Incremental rollout for the Pages new API based config source |
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2019-10-23T13:05:14Z | unassigned | cleanup registered nodes in chef |
2019-10-14T09:44:00Z | ahmadsherif | Rollout SIDEKIQ_MONITOR_WORKER=1 across the sidekiq fleet |
2019-10-08T18:52:44Z | unassigned | increase api nodes cpu utilization by adding more unicorn workers |
2019-10-08T10:43:58Z | unassigned | rails console scripts getting OOM killed on console-01-sv-gprd followed by high disk IO and VM being unresponsive |
2019-09-03T07:50:23Z | unassigned | Production Kibana returninig 502s occasionally |
This issue was automatically generated using oncall-robot-assistant