OnCall report for period: 2019-11-26 - 2019-12-03
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Alejandro Rodriguez |
SRE 8 Hour | Alex Hanselka |
SRE 8 Hour | Devin Sylva |
SRE 8 Hour | Hendrik Meyer |
SRE 8 Hour | Nels Nelson |
SRE 8 Hour | Ben Kochie |
SRE 8 Hour | Graeme Gillies |
PagerDuty Incidents
* Number of incidents: **78**
Show/Hide Table
Created | Summary |
---|---|
2019-11-26T09:07:14Z | [16071] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-26T14:57:56Z | [16076] Firing 1 - The mailroom service is less available than normal |
2019-11-26T15:08:41Z | [16077] Firing 1 - The mailroom service is less available than normal |
2019-11-26T15:13:56Z | [16078] Firing 1 - The mailroom service is less available than normal |
2019-11-26T16:52:04Z | [16079] Firing 1 - Gitaly error rate is too high: 7.20 |
2019-11-26T16:52:05Z | [16080] Firing 2 - GitalyErrorRateTooHigh |
2019-11-26T22:52:14Z | [16082] Pingdom check check:https://version.gitlab.com/ is down |
2019-11-27T00:21:59Z | [16083] Firing 1 - Postgres seems to be processing very few transactions |
2019-11-27T10:38:11Z | [16086] Firing 1 - Postgres transactions showing high rate of statement timeouts |
2019-11-27T11:01:51Z | [16087] Firing 1 - Large number of overdue pull mirror jobs: 8228 |
2019-11-27T12:25:42Z | [16092] Firing 1 - Postgres transactions showing high rate of statement timeouts |
2019-11-27T12:56:29Z | [16095] Firing 1 - Postgres transactions showing high rate of statement timeouts |
2019-11-27T15:08:22Z | [16107] Firing 1 - Increased Error Rate Across Fleet |
2019-11-27T15:14:07Z | [16108] Firing 1 - High Rails Error Rate on Front End |
2019-11-27T15:15:37Z | [16109] Firing 1 - Large number of overdue pull mirror jobs: 6749 |
2019-11-27T15:25:08Z | [16112] Firing 1 - High Rails Error Rate on Front End |
2019-11-27T15:29:36Z | [16113] Firing 2 - IncreasedErrorRateOtherBackends |
2019-11-27T15:56:53Z | [16116] Firing 2 - IncreasedErrorRateOtherBackends |
2019-11-27T20:29:27Z | [16118] Firing 1 - Unused Replication Slots for patroni-06-db-gprd.c.gitlab-production.internal |
2019-11-27T20:46:11Z | [16119] Firing 1 - Postgres seems to be processing very few transactions |
2019-11-28T09:43:26Z | [16120] Firing 1 - Gitaly latency on file-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2019-11-28T10:48:55Z | [16121] Web latency spikes |
2019-11-28T11:02:33Z | [16122] Pingdom check check:https://gitlab.com/projects/new is down |
2019-11-28T11:03:06Z | [16123] Pingdom check check:https://gitlab.com/ is down |
2019-11-28T11:03:11Z | [16124] Firing 1 - High Error Rate on Front End Web |
2019-11-28T11:03:11Z | [16125] Firing 3 - IncreasedErrorRateOtherBackends |
2019-11-28T11:03:15Z | [16126] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2019-11-28T11:03:47Z | [16127] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down |
2019-11-28T11:04:06Z | [16129] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2019-11-28T11:05:00Z | [16130] Firing 1 - GitLab.com is down for 2 minutes |
2019-11-28T11:05:05Z | [16131] Firing 1 - GitLab.com is down for 2 minutes |
2019-11-28T11:05:39Z | [16132] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2019-11-28T11:05:47Z | [16133] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2019-11-28T11:05:52Z | [16134] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down |
2019-11-28T11:05:54Z | [16135] Firing 95 - IncreasedBackendConnectionErrors |
2019-11-28T11:06:01Z | [16136] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down |
2019-11-28T11:07:51Z | [16137] Firing 1 - Increased Error Rate Across Fleet |
2019-11-28T11:09:23Z | [16144] Firing 7 - IncreasedServerResponseErrors |
2019-11-28T11:11:06Z | [16145] Firing 1 - Gitaly error rate is too high: 2.89 |
2019-11-28T11:16:16Z | [16146] Firing 5 - PostgreSQL_CommitRateTooLow |
2019-11-28T11:18:51Z | [16147] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-11-28T11:19:22Z | [16148] Firing 1 - Increased Error Rate Across Fleet |
2019-11-28T11:20:07Z | [16149] Firing 2 - service_availability_out_of_bounds_lower_5m |
2019-11-28T11:30:52Z | [16151] Firing 1 - High Error Rate on Front End Web |
2019-11-28T11:34:19Z | [16152] Firing 1 - sentry.gitlab.net is down |
2019-11-28T11:36:39Z | [16153] Firing 1 - Large number of overdue pull mirror jobs: 12267.5 |
2019-11-28T11:39:23Z | [16154] Firing 1 - High Rails Error Rate on Front End |
2019-11-28T11:45:39Z | [16155] Firing 1 - staging.GitLab.com is down for 30 minutes |
2019-11-28T11:45:51Z | [16156] Firing 1 - staging.GitLab.com is down for 30 minutes |
2019-11-28T11:51:41Z | [16159] Firing 1 - High Error Rate on Front End Web |
2019-11-28T11:56:14Z | [16161] Firing 6 - PostgreSQL_CommitRateTooLow |
2019-11-28T12:00:54Z | [16162] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-11-28T12:01:40Z | [16163] Firing 1 - High Error Rate on Front End Web |
2019-11-28T12:02:43Z | [16164] Firing 1 - Increased Error Rate Across Fleet |
2019-11-28T12:03:18Z | [16165] Pingdom check check:https://gitlab-examples.gitlab.io/ is down |
2019-11-28T12:05:00Z | [16166] Firing 2 - GitLabPagesProdFeIpPossibleChange |
2019-11-28T12:07:37Z | [16167] Firing 1 - High Error Rate on Front End Web |
2019-11-28T12:07:52Z | [16168] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-11-28T12:11:15Z | [16170] Firing 2 - PostgreSQL_CommitRateTooLow |
2019-11-28T12:13:33Z | [16171] Firing 1 - Chef client failures have reached critical levels |
2019-11-28T12:28:07Z | [16173] Firing 3 - IncreasedErrorRateOtherBackends |
2019-11-28T12:28:07Z | [16174] Firing 1 - High Error Rate on Front End Web |
2019-11-28T12:31:53Z | [16175] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2019-11-28T12:32:08Z | [16176] Firing 1 - Increased Error Rate Across Fleet |
2019-11-28T12:32:40Z | [16179] Firing 1 - High Error Rate on Front End Web |
2019-11-28T12:35:17Z | [16183] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2019-11-28T12:37:06Z | [16184] Firing 1 - Gitaly error rate is too high: 2.38 |
2019-11-28T12:37:37Z | [16185] Firing 1 - High Error Rate on Front End Web |
2019-11-28T16:51:58Z | [16187] Firing 1 - Postgres transactions showing high rate of statement timeouts |
2019-11-28T17:57:12Z | [16190] Firing 1 - Postgres transactions showing high rate of statement timeouts |
2019-11-29T00:46:11Z | [16192] Firing 1 - 5% disk space left |
2019-12-01T00:37:51Z | [16199] Firing 1 - Large number of overdue pull mirror jobs: 5105.5 |
2019-12-01T14:29:57Z | [16201] Firing 1 - Gitaly is down on file-19-stor-gprd.c.gitlab-production.internal |
2019-12-01T14:35:22Z | [16202] Firing 1 - Large number of overdue pull mirror jobs: 5727.5 |
2019-12-01T19:59:20Z | [16204] Firing 1 - staging.GitLab.com is down for 30 minutes |
2019-12-02T17:03:51Z | [16221] Firing 1 - Gitaly latency on file-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2019-12-02T23:22:59Z | [16223] test |
2019-12-03T03:17:15Z | [16225] Pingdom check check:https://version.gitlab.com/ is down |
7 Day Issue Stats
- Oncall issues : 2
- Access Request : 0
- Change Issues : 6
- Incident Issues : 5
- CorrectiveAction Issues : 0
Change Issues
- 2019-12-03T05:56:02Z - Upgrade Sentry instance to 9.1.2 - unassigned
- 2019-12-02T11:33:09Z - Change Request :: Add Vacuum metrics to Prometheus - Finotto
- 2019-12-02T11:03:58Z - Add additional web workers to reduce front-end saturation - jarv
- 2019-11-28T23:04:44Z - Clean up files open underneath NFS mounts on API servers - cmiskell
- 2019-11-27T01:28:12Z - Remove legacy prometheus IPs from firewall rules on CI - cmiskell
- 2019-11-26T20:15:05Z - DNS A Record Change - jenkins.tanuki.cloud - unassigned
Incident Issues
- 2019-12-02T12:37:22Z - 2019-12-03 Lots of GRPC Cancelled on file38 - T4cC0re | ~S2 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1427
- 2019-11-28T12:44:58Z - 2019-11-28 GitLab.com down - T4cC0re | ~S1 | ~"Service:API" ~"Service:CI Runners" ~"Service:Git" ~"Service:GitLab Rails" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1421
- 2019-11-27T11:19:38Z - 2019-11-27 Increased latency on API fleet - ahanselka | ~S1 | ~"Service:API" |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1419
- 2019-11-26T18:02:04Z - 2019-11-25: Errors reported from users on file-34 - ahanselka | ~S3 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1415
- 2019-11-26T17:20:53Z - 2019-11-26: High Gitaly Error Rate on file-34 - ahanselka | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/issues/1414
CorrectiveAction Issues
- 2019-12-02T18:09:21Z - Staging alerting and SLA discussion - unassigned
- 2019-11-28T14:32:33Z - Flush iptables rules that were mistakenly added - craigf
Open Issue Stats
- Oncall issues : 15
- Change issues : 23
- Incident issues : 2
- Access Request : 5
- CorrectiveAction : 66
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-12-03T05:56:02Z | unassigned | Upgrade Sentry instance to 9.1.2 |
2019-12-02T11:33:09Z | Finotto | Change Request :: Add Vacuum metrics to Prometheus |
2019-11-28T23:04:44Z | cmiskell | Clean up files open underneath NFS mounts on API servers |
2019-11-26T20:15:05Z | unassigned | DNS A Record Change - jenkins.tanuki.cloud |
2019-11-22T00:44:34Z | cindy | WIP: Ops migration to us central |
2019-11-20T19:06:06Z | msmiley | Blacklist URL at HAProxy Frontend |
2019-11-19T22:45:36Z | Finotto | Change Request :: Autovacuum tuning |
2019-11-18T22:09:10Z | DylanGriffith | Re-enable Elasticsearch Functionality on GitLab.com |
2019-11-13T13:36:42Z | bjk-gitlab | Deploy Thanos sidecar --min-time flag |
2019-11-06T00:09:14Z | msmiley | Delete stale residual Redis RDB files from redis-cache hosts |
2019-11-05T21:31:17Z | unassigned | Update PgBouncer to v1.12.0 |
2019-10-16T14:37:43Z | nnelson | Migrate large projects off file-33-stor-gprd to file-40-stor-gprd |
2019-10-15T15:13:30Z | nnelson | Migrate large projects off file-34-stor-gprd to file-40-stor-gprd |
2019-10-10T07:47:57Z | jarv | Remove the shared filesystem mounts on staging/canary/production |
2019-10-02T18:40:35Z | nnelson | Migrate large projects off file-33-stor-gprd to file-40-stor-gprd |
2019-09-27T10:54:08Z | hphilipps | Roll out Cloud NAT to CI shared runners |
2019-09-18T10:00:20Z | unassigned | Add two more nodes to the realtime fleet |
2019-09-12T08:00:43Z | andrewn | Add more hosts to the pipeline fleet |
2019-08-29T09:35:07Z | mwasilewski-gitlab | switch logging to the new ES7 clusters |
2019-08-14T19:42:03Z | gerardo.herzig | Removal of unused configuration files in patroni nodes |
2019-07-02T20:54:46Z | devin | Migrate to Hashed Storage from legacy project storage |
2019-06-19T06:59:56Z | unassigned | Implement pipeline quotas on GitLab.com |
2019-03-19T17:32:50Z | Finotto | Convert PK/FK from int4 to int8: events.id, push_event_payloads.event_id, and ci_build_trace_sections.id. Stage 1 of 2. |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-11-19T17:35:12Z | bjk-gitlab | OOM alerts firing on Thanos |
2019-10-29T03:36:15Z | unassigned | 2019-10-29 3:30 UTC Database statement timeouts while inserting data |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2019-11-26T08:40:30Z | unassigned | single_node_cpu/sidekiq Saturation: investigate uneven CPU load on sidekiq-asap-02 |
2019-11-25T14:16:00Z | craigf | Saturation spikes for web unicorn workers |
2019-11-19T11:00:52Z | cmcfarland | customers.stg.gitlab.com deployment error |
2019-11-13T12:36:33Z | hphilipps | RCA: 2019-11-12: Latency Apdex score degradation because of pgbouncer saturation |
2019-11-12T13:10:10Z | hphilipps | Fix sidekiq error ratio metrics |
2019-11-01T14:08:54Z | unassigned | Resource saturation: sidekiq_workers/sidekiq, pullmirrors fleet |
2019-11-01T13:56:45Z | aamarsanaa | Resource saturation: redis_memory on redis-cache |
2019-10-23T13:05:14Z | unassigned | cleanup registered nodes in chef |
2019-10-14T09:44:00Z | ahmadsherif | Rollout SIDEKIQ_MONITOR_WORKER=1 across the sidekiq fleet |
2019-10-08T18:52:44Z | cmcfarland | increase api nodes cpu utilization by adding more unicorn workers |
2019-10-08T10:43:58Z | unassigned | rails console scripts getting OOM killed on console-01-sv-gprd followed by high disk IO and VM being unresponsive |
2019-09-03T07:50:23Z | unassigned | Production Kibana returninig 502s occasionally |
2019-08-30T14:34:47Z | unassigned | Customer Staging - Sentry is not reporting |
2019-06-18T09:15:35Z | unassigned | Create alert for registry latency or memory |
2019-06-13T03:47:53Z | aamarsanaa | Update Query source to Global in Grafana dashboards that are not pulling any metrics |
This issue was automatically generated using oncall-robot-assistant