OnCall report for period: 2020-05-12 - 2020-05-19
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Devin Sylva |
SRE 8 Hour | Craig Barrett |
SRE 8 Hour | Hendrik Meyer |
SRE 8 Hour | Henri Philipps |
SRE 8 Hour | Cameron McFarland |
PagerDuty Incidents
* Number of incidents: **100**
Show/Hide Table
Created | Summary |
---|---|
2020-05-12T08:30:46Z | [20686] Firing 1 - SSL certificate for https://license.gitlab.com expires in 15h 29m 58s |
2020-05-12T09:16:21Z | [20687] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-12T11:06:14Z | [20690] Firing 1 - Last WALE backup was seen 20m 2s ago. |
2020-05-12T13:26:03Z | [20691] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-12T15:23:50Z | [20696] Firing 1 - Gitaly latency on file-cny-01-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-12T15:38:51Z | [20699] Firing 1 - Gitaly latency on file-cny-01-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-12T15:58:21Z | [20701] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-12T16:29:52Z | [20702] Firing 1 - Gitaly latency on file-cny-01-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-12T18:15:26Z | [20707] Firing 2 - PostgreSQL_ExporterErrors |
2020-05-12T18:16:56Z | [20708] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-12T18:19:57Z | [20709] Firing 2 - PostgreSQL_WALEReplicationStopped |
2020-05-12T19:03:14Z | [20712] Firing 4 - PostgreSQL_ServiceDown |
2020-05-12T19:04:59Z | [20714] Firing 3 - PatroniIsDown |
2020-05-12T23:06:26Z | [20716] Firing 1 - Postgres seems to be processing very few transactions |
2020-05-12T23:55:05Z | [20718] Firing 1 - Alertmanager is failing sending notifications |
2020-05-12T23:55:35Z | [20719] Firing 2 - AlertmanagerNotificationsFailing |
2020-05-13T00:15:55Z | [20722] Firing 1 - Alertmanager is failing sending notifications |
2020-05-13T00:23:50Z | [20723] Firing 1 - Alertmanager is failing sending notifications |
2020-05-13T00:26:05Z | [20724] Firing 2 - AlertmanagerNotificationsFailing |
2020-05-13T01:23:12Z | [20725] Firing 4 - PostgreSQL_ServiceDown |
2020-05-13T03:05:13Z | [20726] Firing 3 - PatroniIsDown |
2020-05-13T08:29:11Z | [20731] Firing 1 - Postgres seems to be consuming XLOG very slowly |
2020-05-13T08:31:56Z | [20732] Firing 1 - Postgres Replication lag (in bytes) is high |
2020-05-13T08:34:27Z | [20733] Firing 1 - Postgres Replication lag is over 2 minutes |
2020-05-13T09:04:26Z | [20736] ~100 Inconsistencies in PROJECT_AUTHORIZATIONS Table should be resolved ASAP |
2020-05-13T09:22:21Z | [20737] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-13T09:45:21Z | [20738] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-13T11:05:28Z | [20739] Firing 3 - PatroniIsDown |
2020-05-13T12:28:26Z | [20740] Firing 4 - PostgreSQL_ServiceDown |
2020-05-13T12:40:50Z | [20741] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-13T12:46:05Z | [20742] Firing 4 - IncreasedErrorRateOtherBackends |
2020-05-13T12:46:49Z | [20743] Firing 1 - Increased Error Rate Across Fleet |
2020-05-13T12:47:12Z | [20744] Firing 2 - GitLabPagesProdFeIpPossibleChange |
2020-05-13T12:47:50Z | [20745] Firing 1 - Increased Server Response Errors |
2020-05-13T12:50:58Z | [20748] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-13T12:51:41Z | [20750] Firing 1 - High load in database patroni-02-db-gprd.c.gitlab-production.internal: 303.23 |
2020-05-13T13:12:42Z | [20751] Firing 1 - Postgres seems to be consuming XLOG very slowly |
2020-05-13T13:15:26Z | [20752] Firing 1 - Postgres Replication lag is over 2 minutes |
2020-05-13T13:15:26Z | [20753] Firing 1 - Postgres Replication lag (in bytes) is high |
2020-05-13T13:15:40Z | [20754] Firing 1 - Postgres seems to be processing very few transactions |
2020-05-13T13:28:26Z | [20755] Firing 1 - Postgres seems to be processing very few transactions |
2020-05-13T13:46:22Z | [20757] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-13T13:46:41Z | [20758] Firing 1 - Unused Replication Slots for patroni-11-db-gprd.c.gitlab-production.internal |
2020-05-13T14:48:25Z | [20762] Firing 1 - HAProxy process high CPU usage on fe-registry-01-lb-gprd.c.gitlab-production.internal |
2020-05-13T15:02:28Z | [20763] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-13T15:50:06Z | [20766] Firing 3 - IncreasedErrorRateOtherBackends |
2020-05-13T16:28:08Z | [20770] 2020-05-13 Prod logs are not working |
2020-05-13T17:50:41Z | [20773] Firing 1 - Unused Replication Slots for patroni-11-db-gprd.c.gitlab-production.internal |
2020-05-13T21:23:26Z | [20779] Firing 1 - postgres-dr-delayed-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2020-05-13T22:21:41Z | [20782] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-14T00:30:32Z | [20783] Firing 1 - SSL certificate for https://int.gprd.gitlab.net/users/sign_in expires in 23h 29m 58s |
2020-05-14T01:50:56Z | [20784] Firing 1 - Unused Replication Slots for patroni-11-db-gprd.c.gitlab-production.internal |
2020-05-14T06:21:56Z | [20785] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-14T08:18:05Z | [20789] Firing 2 - IncreasedErrorRateOtherBackends |
2020-05-14T08:30:46Z | [20790] Firing 1 - SSL certificate for https://int.gprd.gitlab.net/users/sign_in expires in 15h 29m 58s |
2020-05-14T09:43:36Z | [20792] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-14T11:12:18Z | [20796] Degraded performance on shared CI runners |
2020-05-14T11:20:26Z | [20797] Firing 1 - Unused Replication Slots for patroni-11-db-gprd.c.gitlab-production.internal |
2020-05-14T12:45:11Z | [20799] Firing 1 - Unused Replication Slots for patroni-11-db-gprd.c.gitlab-production.internal |
2020-05-14T14:22:12Z | [20805] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-14T14:39:59Z | [20807] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-14T15:15:46Z | [20811] On Rails Console, Refresh PROJECT_AUTHORIZATIONS for 5 users |
2020-05-14T16:43:58Z | [20818] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-14T17:27:36Z | [20819] Pingdom check check:https://gitlab.com/projects/new is down |
2020-05-14T17:28:20Z | [20820] Firing 1 - High Error Rate on Front End Web |
2020-05-14T17:28:21Z | [20821] Firing 1 - Increased Error Rate Across Fleet |
2020-05-14T17:29:59Z | [20823] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-14T17:29:59Z | [20824] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-14T17:32:58Z | [20827] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-14T17:33:06Z | [20828] Pingdom check check:https://gitlab.com/ is down |
2020-05-14T17:42:15Z | [20830] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-14T17:42:16Z | [20831] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-14T18:03:47Z | [20833] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-14T18:17:58Z | [20834] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-14T18:22:29Z | [20835] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-14T18:22:30Z | [20836] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-14T20:44:59Z | [20839] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-14T22:22:26Z | [20841] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-15T01:01:58Z | [20842] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-15T01:37:59Z | [20843] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-15T02:22:45Z | [20844] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-15T02:22:46Z | [20845] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-15T05:27:58Z | [20847] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-15T05:32:57Z | [20849] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-15T06:26:41Z | [20850] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-15T07:02:58Z | [20851] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-15T10:22:59Z | [20855] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-15T10:22:59Z | [20856] Firing 1 - GitLab.com is down for 2 minutes |
2020-05-15T13:32:58Z | [20858] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-15T18:07:58Z | [20861] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-15T21:37:58Z | [20864] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-16T05:42:58Z | [20870] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-16T13:47:15Z | [20874] Firing 1 - Last WALE backup was seen 20m 10s ago. |
2020-05-17T14:03:00Z | [20888] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-17T22:07:59Z | [20895] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-18T06:12:58Z | [20897] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-18T14:12:06Z | [20904] Firing 1 - Large number of overdue pull mirror jobs |
2020-05-18T14:41:10Z | [20906] Firing 1 - Multiple versions of Gitaly have been running alongside one another |
2020-05-18T18:08:24Z | [20908] 2020-05-18 Increasing number of Net::OpenTimeout errors sending mail |
2020-05-18T18:45:13Z | [20910] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
7 Day Issue Stats
- Oncall issues : 0
- Access Request : 0
- Change Issues : 0
- Incident Issues : 22
- CorrectiveAction Issues : 0
Change Issues
Incident Issues
- 2020-05-18T18:08:23Z - 2020-05-18 Problems reported with mailroom - unassigned | ~S3 | ServiceMailroom |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2154
- 2020-05-18T18:04:18Z - Version application deployments broken on dast step - devin | | ServiceVersion |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2153
- 2020-05-18T11:41:08Z - 2020-05-18: Canary gitaly is not being scraped by Prometheus - bjk-gitlab | ~S3 | ServiceMonitoring |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2150
- 2020-05-16T06:09:23Z - 503 errors from Cloudflare - craig | ~S4 | ServiceCloudflare |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2148
- 2020-05-15T18:23:21Z - 2020-05-15 Flood of project exports - unassigned | ~S3 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2146
- 2020-05-15T09:43:15Z - 2020-05-15:
gitaly
service (cny
) stage is not meeting apdex SLOs - unassigned | ~S3 | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2143
- 2020-05-14T18:05:36Z - 2020-05-14 GitLab.com bad gateway errors from Cloudflare - cmcfarland | ~S1 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2139
- 2020-05-14T18:05:14Z - 2020-05-14 Gitlab.com is down - unassigned | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2138
- 2020-05-14T10:55:25Z - reduced indexing capacity of the production logging cluster - igorwwwwwwwwwwwwwwwwwwww | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2131
- 2020-05-14T08:34:24Z - Short drop in traffic due to starting Postrges node - unassigned | ~S1 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2130
- 2020-05-14T08:02:16Z - Global search incremental indexing queue growing - jarv | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2129
- 2020-05-13T17:13:23Z - Version application deployment state issue - devin | | ServiceVersion |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2126
- 2020-05-13T16:28:07Z - 2020-05-13 Prod logs are not working - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2125
- 2020-05-13T15:56:32Z - 2020-05-13 Drops in web APDEX in cny and main - unassigned | ~S3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2124
- 2020-05-13T14:32:34Z - drop in the indexing rates for multiple indices - igorwwwwwwwwwwwwwwwwwwww | ~S3 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2122
- 2020-05-13T13:22:11Z - High latencies on Gitaly canary - unassigned | ~S4 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2121
- 2020-05-13T12:59:20Z - Puma saturation causing 503s - Extremely high load on patroni-02 - unassigned | ~S1 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2119
- 2020-05-13T08:47:01Z - Short burst of Postgres replication lag - unassigned | ~S4 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2118
- 2020-05-13T03:16:51Z - Replication failing to postgres-dr-archive-01-db-gprd and postgres-dr-delayed-01-db-gprd - craig | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2116
- 2020-05-12T13:51:53Z - 2020-05-12 Gitaly canary slowdown - T4cC0re | ~S3 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2114
- 2020-05-12T12:03:20Z - Unclear sudden growth of Postgres - T4cC0re | ~S3 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2113
- 2020-05-12T08:51:55Z - OOM kills in the ES cluster - mwasilewski-gitlab | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2112
CorrectiveAction Issues
- 2020-05-15T13:00:56Z - Alerting and monitoring of the global search indexing queue - unassigned
- 2020-05-13T15:21:13Z - Test changing instance type for file-cny-01 to C2 - unassigned
- 2020-05-13T14:35:27Z - Add monitoring to alert when Gitlab project pre-caching is falling behind - unassigned
- 2020-05-12T18:23:54Z - Allow and set specific concurrency limits for the canary file node - bjk-gitlab
Open Issue Stats
- Oncall issues : 3
- Change issues : 0
- Incident issues : 2
- Access Request : 4
- CorrectiveAction : 88
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-05-18T18:08:23Z | unassigned | 2020-05-18 Problems reported with mailroom |
2020-05-16T06:09:23Z | craig | 503 errors from Cloudflare |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-05-11T06:18:27Z | unassigned | Manually remove project |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant
Edited by Dave Smith