OnCall report for period: 2020-02-04 - 2020-02-11
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Alex Hanselka |
SRE 8 Hour | Cindy Pallares |
SRE 8 Hour | Amar Amarsanaa |
SRE 8 Hour | Hendrik Meyer |
SRE 8 Hour | Craig Furman |
SRE 8 Hour | Ben Kochie |
SRE 8 Hour | Graeme Gillies |
PagerDuty Incidents
* Number of incidents: **41**
Show/Hide Table
Created | Summary |
---|---|
2020-02-04T10:32:59Z | [16976] Firing 1 - The sidekiq service is less available than normal |
2020-02-04T14:18:51Z | [16980] Firing 1 - Increased Error Rate Across Fleet |
2020-02-04T20:46:43Z | [16998] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2020-02-04T20:51:40Z | [16999] Firing 2 - IncreasedBackendConnectionErrors |
2020-02-04T20:54:58Z | [17002] Firing 1 - 5xx Error Rate on Docker Registry Load Balancers |
2020-02-04T20:55:05Z | [17003] Firing 1 - GitLab Registry is down for 1 minute |
2020-02-04T20:56:36Z | [17005] Firing 1 - The registry service (main stage) has a apdex score (latency) below SLO |
2020-02-04T21:26:36Z | [17008] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2020-02-05T11:22:35Z | [17020] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-02-05T11:22:35Z | [17021] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-02-05T18:38:31Z | [17038] Firing 1 - The registry service (cny stage) has a apdex score (latency) below SLO |
2020-02-05T18:38:31Z | [17037] Firing 1 - Bad canary? The cny stage of the registry service has a apdex score (latency) below SLO, but the main stage does not. |
2020-02-06T02:11:21Z | [17059] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-06T08:39:35Z | [17063] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-06T11:14:55Z | [17067] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-06T12:37:39Z | [17071] Firing 1 - The registry service (cny stage) has a apdex score (latency) below SLO |
2020-02-06T12:37:39Z | [17072] Firing 1 - Bad canary? The cny stage of the registry service has a apdex score (latency) below SLO, but the main stage does not. |
2020-02-06T14:10:28Z | [17075] Firing 1 - The registry service (cny stage) has a apdex score (latency) below SLO |
2020-02-06T14:10:28Z | [17076] Firing 1 - Bad canary? The cny stage of the registry service has a apdex score (latency) below SLO, but the main stage does not. |
2020-02-06T20:42:28Z | [17094] Firing 1 - The registry service (cny stage) has a apdex score (latency) below SLO |
2020-02-06T20:42:29Z | [17095] Firing 1 - Bad canary? The cny stage of the registry service has a apdex score (latency) below SLO, but the main stage does not. |
2020-02-07T11:55:30Z | [17124] Firing 1 - The registry service (cny stage) has a apdex score (latency) below SLO |
2020-02-07T11:55:30Z | [17125] Firing 1 - Bad canary? The cny stage of the registry service has a apdex score (latency) below SLO, but the main stage does not. |
2020-02-07T17:02:50Z | [17137] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-07T17:28:21Z | [17141] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-07T17:45:35Z | [17145] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-07T18:21:23Z | [17146] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-07T18:32:06Z | [17147] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-07T18:43:06Z | [17148] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-07T19:09:52Z | [17150] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-07T19:13:09Z | [17151] Pingdom check check:https://forum.gitlab.com/srv/status is down |
2020-02-07T21:02:50Z | [17161] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-02-10T13:16:14Z | [17280] Firing 1 - Postgres is generating XLOG too fast, expect this to cause replication lag |
2020-02-10T13:27:29Z | [17281] Firing 1 - Chef client failures have reached critical levels |
2020-02-10T14:04:26Z | [17282] Pingdom check check:https://snowplow.trx.gitlab.net/health is down |
2020-02-10T14:06:10Z | [17283] Firing 1 - 5% disk space left |
2020-02-10T14:12:31Z | [17284] Firing 1 - Chef client failures have reached critical levels |
2020-02-10T15:54:24Z | [17288] Pingdom check check:https://snowplow.trx.gitlab.net/health is down |
2020-02-10T16:14:26Z | [17289] Pingdom check check:https://snowplow.trx.gitlab.net/health is down |
2020-02-10T18:43:14Z | [17293] Pingdom check check:https://forum.gitlab.com/srv/status is down |
2020-02-10T18:51:17Z | [17294] Pingdom check check:https://version.gitlab.com/ is down |
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 1
- Incident Issues : 10
- CorrectiveAction Issues : 2
Change Issues
- 2020-02-10T12:27:59Z - Remove
UserMentions::CreateResourceUserMention
BackgrounMigrationWorker - reprazent
Incident Issues
- 2020-02-10T16:26:31Z - 2020-02-10: User exporting many projects causing SLO alerts - unassigned | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1647
- 2020-02-10T14:08:25Z - 2020-02-10: Almost full disk on api-01-sv-gprd.c.gitlab-production.internal - craigf | ~S4 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1646
- 2020-02-10T13:48:29Z - 2020-02-10: Sidekiq asap workers saturated due to spam - craigf | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1645
- 2020-02-10T12:42:22Z - 2020-02-10: td-agent won't start up again after restart - jarv | ~S4 | ServiceMonitoring |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1643
- 2020-02-10T11:18:33Z - 2020-02-10: Unread incoming emails are piling up - craigf | ~S4 | ServiceMailroom |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1641
- 2020-02-10T09:33:02Z - 2020-02-10: Spawn timeouts and Gitaly errors on file-cny-01 - craigf | ~S4 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1638
- 2020-02-07T12:05:51Z - 2020-02-07: CPU saturation on git-13 - craigf | ~S4 | ServiceGit |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1637
- 2020-02-06T15:11:34Z - 2020-02-06: api-05 CPU saturation - ahanselka | ~S4 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1630
- 2020-02-06T10:36:52Z - 2020-02-06: Slightly elevated error rate: web - craigf | ~S4 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1628
- 2020-02-04T20:59:46Z - 2020-02-04 - GitLab.com Registry Down - ahanselka | ~S1 | ServiceContainer Registry |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1624
CorrectiveAction Issues
- 2020-02-10T13:42:25Z - Fix gitlab-fluentd cookbook so that it installs required plugin dependencies - jarv
- 2020-02-06T22:41:34Z - Regenerate docker-machine certs during gitlab-runner upgrades - msmiley
- 2020-02-05T15:42:32Z - Update documentation for on call needing to page IMOC and CMOC - cindy
Open Issue Stats
- Oncall issues : 12
- Change issues : 1
- Incident issues : 2
- Access Request : 5
- CorrectiveAction : 70
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-10-16T14:37:43Z | nnelson | Migrate large projects off file-33-stor-gprd to file-43-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-02-10T09:33:02Z | craigf | 2020-02-10: Spawn timeouts and Gitaly errors on file-cny-01 |
2020-01-30T22:25:36Z | msmiley | Disk writes stalled on Gitaly node file-38 |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-02-03T09:38:06Z | unassigned | Research possibility of creating/using sre-oncall group within GitLab |
2020-01-30T21:26:03Z | unassigned | Update environment variables for customers.stg.gitlab.com |
2020-01-27T16:16:39Z | alejandro | Delete Pipeline on Gitlab project |
2020-01-24T01:26:38Z | aamarsanaa | Automate oncall handover issue creation |
2020-01-16T06:07:03Z | aamarsanaa | Incremental rollout for the Pages new API based config source |
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2020-01-03T14:53:08Z | hphilipps | Incident Review: 2019-12-27 Spammers causing large mailers Sidekiq queue |
2019-10-23T13:05:14Z | unassigned | cleanup registered nodes in chef |
2019-10-14T09:44:00Z | ahmadsherif | Rollout SIDEKIQ_MONITOR_WORKER=1 across the sidekiq fleet |
2019-10-08T18:52:44Z | unassigned | increase api nodes cpu utilization by adding more unicorn workers |
2019-10-08T10:43:58Z | unassigned | rails console scripts getting OOM killed on console-01-sv-gprd followed by high disk IO and VM being unresponsive |
2019-09-03T07:50:23Z | unassigned | Production Kibana returninig 502s occasionally |
This issue was automatically generated using oncall-robot-assistant