OnCall report for period: 2020-03-10 - 2020-03-17
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Devin Sylva |
SRE 8 Hour | Michal Wasilewski |
SRE 8 Hour | Craig Miskell |
SRE 8 Hour | Matt Smiley |
SRE 8 Hour | Graeme Gillies |
PagerDuty Incidents
* Number of incidents: **66**
Show/Hide Table
Created | Summary |
---|---|
2020-03-10T08:29:55Z | [18433] Firing 2 - AlertmanagerNotificationsFailing |
2020-03-10T08:29:56Z | [18434] Firing 2 - AlertmanagerNotificationsFailing |
2020-03-10T15:26:12Z | [18455] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-10T16:40:50Z | [18458] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-10T23:26:26Z | [18468] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-11T08:26:50Z | [18483] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-11T08:41:50Z | [18486] Firing 1 - Alertmanager is failing sending notifications |
2020-03-11T08:41:51Z | [18487] Firing 1 - Alertmanager is failing sending notifications |
2020-03-11T10:07:51Z | [18492] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-11T10:14:54Z | [18494] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-11T10:23:50Z | [18495] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-11T10:45:51Z | [18498] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-11T12:09:55Z | [18508] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-11T12:21:51Z | [18510] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-11T13:16:50Z | [18515] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m |
2020-03-11T17:44:05Z | [18523] Firing 1 - Gitaly error rate is too high: 18.71 |
2020-03-11T19:10:56Z | [18527] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-11T21:16:12Z | [18535] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-11T22:21:41Z | [18544] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-11T23:23:42Z | [18549] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T00:23:41Z | [18555] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T01:26:12Z | [18560] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T02:19:46Z | [18566] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T03:22:13Z | [18573] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T04:13:42Z | [18578] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T04:26:12Z | [18582] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T05:13:42Z | [18586] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T05:28:12Z | [18590] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T06:15:42Z | [18595] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T06:32:09Z | [18599] Firing 1 - 5% disk space left |
2020-03-12T07:17:41Z | [18600] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T08:20:14Z | [18604] Firing 1 - 5xx Error Rate on staging.gitlab.com CloudFlare zone |
2020-03-12T08:39:23Z | [18608] Firing 1 - 5% disk space left |
2020-03-12T08:49:55Z | [18609] Firing 1 - 5% disk space left |
2020-03-12T08:54:38Z | [18610] Firing 1 - 5% disk space left |
2020-03-12T09:15:09Z | [18614] Firing 1 - Gitaly error rate is too high: 25.89 |
2020-03-12T09:18:24Z | [18616] Firing 1 - 5% disk space left |
2020-03-12T09:53:23Z | [18617] Firing 1 - 5% disk space left |
2020-03-12T10:13:23Z | [18619] Firing 1 - 5% disk space left |
2020-03-12T10:14:05Z | [18620] Firing 1 - Gitaly error rate is too high: 9.47 |
2020-03-12T11:40:27Z | [18626] Firing 1 - customers.gitlab.com is not responding correctly for 2 minutes |
2020-03-12T11:40:28Z | [18627] Firing 1 - customers.gitlab.com is down for 2 minutes |
2020-03-12T23:32:26Z | [18649] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-03-13T11:11:50Z | [18657] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-13T14:40:07Z | [18664] Firing 1 - HPA unable to scale up |
2020-03-13T16:12:58Z | [18668] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-03-13T16:24:59Z | [18669] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-03-13T17:45:09Z | [18671] Firing 1 - HPA unable to scale up |
2020-03-14T00:30:31Z | [18685] Firing 1 - SSL certificate for https://prometheus.gitlab.com expires in 23h 29m 58s |
2020-03-14T02:22:57Z | [18690] Firing 1 - customers.gitlab.com is down for 2 minutes |
2020-03-14T02:22:58Z | [18691] Firing 1 - customers.gitlab.com is not responding correctly for 2 minutes |
2020-03-14T03:26:58Z | [18694] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-03-15T03:19:58Z | [18729] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-03-15T04:07:50Z | [18730] Firing 3 - IncreasedErrorRateOtherBackends |
2020-03-15T04:09:56Z | [18731] Firing 1 - patroni-06-db-gprd.c.gitlab-production.internal postgres service appears down |
2020-03-15T11:02:50Z | [18744] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-15T16:10:58Z | [18757] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-03-15T17:46:50Z | [18760] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-15T18:25:50Z | [18761] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-16T02:25:50Z | [18776] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-16T03:29:58Z | [18779] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-03-16T06:53:51Z | [18780] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-03-16T14:50:07Z | [18798] Firing 1 - HPA unable to scale up |
2020-03-16T17:28:06Z | [18803] Firing 1 - Gitaly error rate is too high: 25.09 |
2020-03-16T19:58:06Z | [18810] Firing 1 - Gitaly error rate is too high: 18.22 |
2020-03-16T21:23:11Z | [18814] Firing 1 - Gitaly error rate is too high: 19.19 |
7 Day Issue Stats
- Oncall issues : 2
- Access Request : 0
- Change Issues : 0
- Incident Issues : 17
- CorrectiveAction Issues : 0
Change Issues
Incident Issues
- 2020-03-16T17:52:46Z - 2020-03-16 CPU saturation on gitaly node file-45 - msmiley | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1774
- 2020-03-16T09:58:17Z - 2020-03-16 Error occurred when fetching sidebar data - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1772
- 2020-03-16T08:41:32Z - Unable to run database migrations on production - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1771
- 2020-03-15T17:32:04Z - 2020-03-15 sidekiq queue project_export saturation - msmiley | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1769
- 2020-03-15T04:21:59Z - 2020-03-15 Database Role Change - devin | ~S3 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1768
- 2020-03-12T11:44:28Z - 2020-03-12 customers.gitlab.com is down - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1763
- 2020-03-12T09:15:16Z - 2020-03-12 Gitaly errors high - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1762
- 2020-03-12T08:44:15Z - 2020-03-12 Gitaly logs not ingested since 2020-03-09 - mwasilewski-gitlab | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1761
- 2020-03-11T18:18:37Z - 2020-03-11: Gitaly error rate is high on file-45 - unassigned | ~S4 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1757
- 2020-03-11T09:58:15Z - 2020-03-11 prolonged redis-cache cpu saturation - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1755
- 2020-03-11T08:45:33Z - 2020-03-11 alertmanager fails to send notifications - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1754
- 2020-03-11T08:36:40Z - 2020-03-11 saturation on redis-cache, latency spike on web - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1753
- 2020-03-10T16:41:26Z - Rotate pubsubuser authentication token in both production and nonproduction logging cluster - msmiley | ~S4 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1752
- 2020-03-10T16:19:47Z - Replication lag on Postgres DR archive replica is over 3 hours and growing - msmiley | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1751
- 2020-03-10T16:04:02Z - Packagecloud DB Daily Backups Missing - msmiley | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1750
- 2020-03-10T09:24:49Z - 2020-03-10 ILM errors in production logging cluster - mwasilewski-gitlab | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1748
- 2020-03-10T08:47:13Z - 2020-03-10 Alertmanager is failing to send notifications - mwasilewski-gitlab | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1747
CorrectiveAction Issues
- 2020-03-13T11:05:12Z - eliminate (and prevent?) redis config drift - unassigned
- 2020-03-12T17:44:52Z - Update customers.gitlab.com cheat sheet to prevent outages - cmcfarland
- 2020-03-12T16:34:24Z - Investigate ignore_changes for disk resizes in terraform - craig
- 2020-03-12T10:49:24Z - run performance tests using a linux repo in staging (and/or in production) continuously - unassigned
- 2020-03-12T10:04:34Z - check for version bump in metadata.rb in CI jobs for commits on master - unassigned
- 2020-03-12T09:43:47Z - do not run kitchen tests twice (on .com and on ops) - unassigned
- 2020-03-11T16:55:23Z - values for metrics missing in Grafana - unassigned
- 2020-03-10T19:06:17Z - cookbook gitlab-elk needs security patching - unassigned
- 2020-03-10T16:51:31Z - Corrective action - get patroni disk sizes back in sync with Terraform state - craig
Open Issue Stats
- Oncall issues : 3
- Change issues : 1
- Incident issues : 1
- Access Request : 5
- CorrectiveAction : 81
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-05T21:23:37Z | devin | Remove deprecated Digital Ocean instance of design.gitlab.com |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-16T17:52:46Z | msmiley | 2020-03-16 CPU saturation on gitaly node file-45 |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-16T23:20:22Z | ggillies | Project requires manually deletion |
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant
Edited by Dave Smith