OnCall report for period: 2020-03-24 - 2020-03-31
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Amar Amarsanaa |
SRE 8 Hour | Henri Philipps |
SRE 8 Hour | Craig Miskell |
SRE 8 Hour | Nels Nelson |
PagerDuty Incidents
* Number of incidents: **80**
Show/Hide Table
Created | Summary |
---|---|
2020-03-24T06:01:06Z | [19086] Firing 1 - Gitaly error rate is too high: 21.11 |
2020-03-24T18:31:06Z | [19099] Firing 1 - HPA unable to scale up |
2020-03-24T19:39:28Z | [19100] Firing 1 - Last WALE backup was seen 20m 10s ago. |
2020-03-24T22:28:56Z | [19107] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-25T02:42:14Z | [19118] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2020-03-25T03:13:56Z | [19120] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2020-03-25T03:24:26Z | [19121] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down |
2020-03-25T12:46:50Z | [19128] Firing 1 - Increased Error Rate Across Fleet |
2020-03-25T15:23:08Z | [19130] Firing 1 - prometheus is restarting frequently |
2020-03-25T15:45:22Z | [19131] Firing 1 - prometheus is unreachable |
2020-03-25T18:00:08Z | [19134] Firing 2 - PrometheusUnreachable |
2020-03-25T18:09:37Z | [19135] Firing 1 - Prometheus not connected to any Alertmanagers |
2020-03-25T18:09:38Z | [19136] Firing 1 - Prometheus not connected to any Alertmanagers |
2020-03-25T18:39:58Z | [19138] Firing 1 - prometheus is restarting frequently |
2020-03-25T18:48:37Z | [19139] Firing 1 - prometheus is restarting frequently |
2020-03-25T19:02:29Z | [19140] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-26T16:12:06Z | [19169] Firing 2 - AlertmanagerNotificationsFailing |
2020-03-26T16:12:06Z | [19170] Firing 1 - Alertmanager is failing sending notifications |
2020-03-26T18:46:57Z | [19173] Need help rotating creds for gitlab-superuser (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9623) |
2020-03-27T00:33:01Z | [19177] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2020-03-27T00:43:16Z | [19179] Firing 1 - Redis master missing for redis-sidekiq |
2020-03-27T00:43:30Z | [19180] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2020-03-28T11:10:14Z | [19210] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down |
2020-03-28T11:10:41Z | [19211] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down |
2020-03-28T11:10:47Z | [19212] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down |
2020-03-28T11:10:59Z | [19213] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down |
2020-03-28T11:12:32Z | [19214] Pingdom check check:https://gitlab.com/projects/new is down |
2020-03-28T11:13:02Z | [19215] Pingdom check check:https://gitlab.com/ is down |
2020-03-28T11:13:15Z | [19216] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2020-03-28T11:13:44Z | [19217] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down |
2020-03-28T11:14:05Z | [19218] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down |
2020-03-28T11:24:14Z | [19220] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-28T11:24:14Z | [19221] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-28T12:20:15Z | [19224] Firing 1 - Chef client failures have reached critical levels |
2020-03-28T12:25:14Z | [19225] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-28T12:25:14Z | [19226] Firing 1 - GitLab.com is down for 2 minutes |
2020-03-28T13:22:06Z | [19228] Firing 1 - Gitaly error rate is too high: 15.18 |
2020-03-28T13:54:54Z | [19229] Firing 1 - Last WALE backup was seen 20m 0s ago. |
2020-03-28T19:32:46Z | [19240] We need to do something about those 429s. |
2020-03-28T22:04:07Z | [19246] Firing 1 - High 4xx Error Rate on Front End Web |
2020-03-28T22:14:07Z | [19247] Firing 1 - High 4xx Error Rate on Front End Web |
2020-03-28T22:21:07Z | [19248] Firing 1 - High 4xx Error Rate on Front End Web |
2020-03-29T05:47:08Z | [19261] Firing 1 - 5% disk space left |
2020-03-29T15:23:08Z | [19266] Firing 1 - Gitaly error rate is too high: 15.42 |
2020-03-30T00:34:28Z | [19274] Managerial assistance/authority for CloudFlare issues |
2020-03-30T03:39:54Z | [19279] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2020-03-30T03:49:47Z | [19280] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2020-03-30T03:59:54Z | [19281] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster |
2020-03-30T04:09:44Z | [19282] Firing 10 - PostgreSQL_XLOGConsumptionTooLow |
2020-03-30T04:12:56Z | [19283] Firing 10 - PostgreSQL_ReplicationLagBytesTooLarge |
2020-03-30T04:15:11Z | [19284] Firing 10 - PostgreSQL_ReplicationLagTooLarge |
2020-03-30T04:16:26Z | [19286] Firing 10 - PostgreSQL_CommitRateTooLow |
2020-03-30T04:16:50Z | [19287] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-30T04:28:20Z | [19291] Firing 1 - Large number of overdue pull mirror jobs |
2020-03-30T04:31:51Z | [19292] Firing 1 - API latency on GitLab.com has been over 1500ms during the last 5m |
2020-03-30T04:32:06Z | [19293] Firing 2 - IncreasedErrorRateOtherBackends |
2020-03-30T04:35:25Z | [19295] DB outage in progress |
2020-03-30T04:35:51Z | [19296] Firing 1 - Git latency on GitLab.com has been over 450ms during the last 5m |
2020-03-30T04:37:12Z | [19297] Firing 10 - PostgreSQL_UnusedReplicationSlot |
2020-03-30T04:44:46Z | [19298] P1 production incident |
2020-03-30T04:53:51Z | [19299] Firing 1 - Git latency on GitLab.com has been over 450ms during the last 5m |
2020-03-30T04:54:15Z | [19300] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down |
2020-03-30T05:34:11Z | [19302] Firing 1 - patroni-09-db-gprd.c.gitlab-production.internal postgres service appears down |
2020-03-30T05:41:26Z | [19303] Firing 9 - PostgreSQL_CommitRateTooLow |
2020-03-30T05:41:51Z | [19304] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-30T05:49:12Z | [19309] Firing 1 - |
2020-03-30T05:59:13Z | [19310] Firing 2 - PostgreSQL_ServiceDown |
2020-03-30T06:06:56Z | [19312] Firing 5 - PostgreSQL_CommitRateTooLow |
2020-03-30T06:27:41Z | [19314] Firing 2 - PostgreSQL_CommitRateTooLow |
2020-03-30T09:23:11Z | [19318] Firing 3 - PostgreSQL_ReplicaStaleXmin |
2020-03-30T10:11:14Z | [19319] Firing 1 - Chef client failures have reached critical levels |
2020-03-30T11:03:31Z | [19320] Rotate gitlab-replicator Credentials Found in patroni.yml Potentially Leaked in Issue |
2020-03-30T12:06:58Z | [19323] Firing 1 - Chef client failures have reached critical levels |
2020-03-30T13:49:46Z | [19325] Firing 1 - Redis cluster redis-cache is missing instances |
2020-03-30T21:01:14Z | [19340] Firing 4 - ChefClientErrorCritical |
2020-03-30T21:02:01Z | [19341] Firing 2 - ChefClientErrorCritical |
2020-03-31T02:00:12Z | [19344] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-31T03:17:56Z | [19346] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-31T05:01:28Z | [19347] Firing 4 - ChefClientErrorCritical |
2020-03-31T05:02:15Z | [19348] Firing 2 - ChefClientErrorCritical |
7 Day Issue Stats
- Oncall issues : 7
- Access Request : 1
- Change Issues : 4
- Incident Issues : 20
- CorrectiveAction Issues : 0
Change Issues
- 2020-03-29T18:10:32Z - Correct
X-Forwarded-For
header either at haproxy config level or in cloudflare - nnelson - 2020-03-28T22:44:02Z - Revert rate-limiting settings for haproxy - nnelson
- 2020-03-28T19:39:37Z - Disable rate limiting on HAProxy - nnelson
- 2020-03-24T15:14:32Z - Delete remaining projects without hashed storage feature which are in the pending delete state - nnelson
Incident Issues
- 2020-03-30T17:01:13Z - 2020-03-30: The
sidekiq
service (main
stage) has a apdex score (latency) below SLO - nnelson | ~S3 | ServiceSidekiq |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1872
- 2020-03-30T14:33:23Z - Accidental reboot of redis-cache-03, followed by failover of redis-cache-01 - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1870
- 2020-03-30T14:33:11Z - Accidental reboot of redis-cache-03, followed by failover of redis-cache-01 - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1869
- 2020-03-30T14:33:06Z - Accidental reboot of redis-cache-03, followed by failover of redis-cache-01 - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1868
- 2020-03-30T04:19:22Z - 2020-03-30 Database failover and loss of sync to replicas - cmiskell | ~S1 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1865
- 2020-03-29T15:29:05Z - 2020-03-29: Gitaly error rate on <code data-sourcepos="121:61-121:70">nfs-file46</code> is too high: 15.42 - nnelson | ~S4 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1859
- 2020-03-29T06:22:41Z - 2020-03-29 /var/log almost full on git-09-sv-gprd - cmiskell | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1858
- 2020-03-28T22:24:51Z - 2020-03-28: High 4xx Error Rate on Front End Web - nnelson | ~S4 | ServiceHAProxy |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1856
- 2020-03-28T13:34:32Z - High CPU usage on file-46 and gitaly error rate above SLO - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1854
- 2020-03-27T13:00:35Z - ExternalDiffUploader throwing 500s - unassigned | ~S3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1852
- 2020-03-26T16:19:32Z - 2020-03-24: Alertmanager is seeing errors for integration webhook &
ci-runners
service (main
stage) has a apdex score (latency) below SLO - nnelson | ~S4 | ServiceCI Runners |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1846
- 2020-03-26T06:48:34Z - 2020-03-26 The
sidekiq
service (main
stage) has a apdex score (latency) below SLO - cmiskell | | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1842
- 2020-03-25T15:46:13Z - 2020-03-25: Prometheus is restarting frequently & Prometheus is unreachable - nnelson | ~S4 | ServicePrometheus |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1838
- 2020-03-25T11:57:36Z - Massive API requests to a single endpoint causing SLO alerts - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1832
- 2020-03-24T22:33:28Z - 2020-03-24: Postgres Replication lag is over 3 hours on archive recovery replica - nnelson | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1827
- 2020-03-24T20:08:57Z - 2020-03-24: Last WALE backup was seen 20m 10s ago - nnelson | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1826
- 2020-03-24T18:40:20Z - 2020-03-24: HPA unable to scale up - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1825
- 2020-03-24T18:23:03Z - 2020-03-24: Potential password spraying activity - nnelson | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1824
- 2020-03-24T13:58:33Z - Elevated API request rate leading to SLO violations - hphilipps | ~S3 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1821
- 2020-03-24T06:22:03Z - 2020-03-24: Gitaly error rate is too high: 21.11 - cmiskell | | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1819
CorrectiveAction Issues
- 2020-03-25T19:37:28Z - Split cookbook publishing MRs - alejandro
Open Issue Stats
- Oncall issues : 9
- Change issues : 2
- Incident issues : 2
- Access Request : 5
- CorrectiveAction : 84
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-24T15:14:32Z | nnelson | Delete remaining projects without hashed storage feature which are in the pending delete state |
2020-01-14T23:08:17Z | nnelson | Migrate large projects off file-35-stor-gprd to file-45-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-22T05:52:53Z | ggillies | Disk filling up on web-33-sv-gprd.c.gitlab-production.internal |
2020-03-17T14:00:31Z | bjk-gitlab | Inconsistencies between responses returned in Grafana, Prometheus and Thanos |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2020-03-30T09:26:36Z | hphilipps | RCA: 2020-03-30 Database failover and loss of sync to replicas |
2020-03-27T21:48:14Z | cynthia | Import request (for SCGTS): q2c (itimsxgen) |
2020-03-27T21:43:04Z | cynthia | Import request (for SCGTS): q2c (webapps) |
2020-03-26T19:51:21Z | unassigned | Fix access scopes on postgres-dr-delayed-01-db-gprd |
2020-03-24T17:05:33Z | unassigned | Project deletion required |
2020-03-23T23:43:57Z | unassigned | Manually remove project |
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant
Edited by AnthonySandoval