OnCall report for period: 2020-04-21 - 2020-04-28
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Michal Wasilewski |
SRE 8 Hour | Craig Miskell |
SRE 8 Hour | Matt Smiley |
SRE 8 Hour | Graeme Gillies |
PagerDuty Incidents
* Number of incidents: **30**
Show/Hide Table
Created | Summary |
---|---|
2020-04-21T15:02:56Z | [20014] Firing 11 - PostgreSQL_ReplicaStaleXmin |
2020-04-21T18:21:50Z | [20017] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-04-21T18:43:11Z | [20018] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-04-22T00:18:11Z | [20024] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-04-22T17:10:34Z | [20043] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-22T17:10:35Z | [20044] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-22T17:44:12Z | [20048] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-04-23T01:44:27Z | [20056] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-04-23T10:40:51Z | [20063] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-23T11:06:26Z | [20065] Firing 11 - PostgreSQL_ReplicaStaleXmin |
2020-04-23T13:35:07Z | [20066] Firing 1 - Large number of overdue pull mirror jobs |
2020-04-23T14:16:13Z | [20069] Firing 1 - Chef client failures have reached critical levels |
2020-04-23T14:19:50Z | [20070] Firing 1 - Chef client failures have reached critical levels |
2020-04-23T14:41:14Z | [20071] Firing 1 - Chef client failures have reached critical levels |
2020-04-23T15:01:38Z | [20073] Firing 1 - HAProxy process high CPU usage on fe-registry-02-lb-gprd.c.gitlab-production.internal |
2020-04-23T18:12:56Z | [20076] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-04-24T00:09:35Z | [20081] Pingdom check check:https://snowplow.trx.gitlab.net/health is down |
2020-04-24T00:49:24Z | [20082] Pingdom check check:https://snowplow.trx.gitlab.net/health is down |
2020-04-24T02:13:13Z | [20084] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-04-24T10:47:42Z | [20087] Firing 1 - GitLab Pages production front-end IP could've been changed |
2020-04-24T10:48:28Z | [20088] Firing 1 - GitLab Pages production front-end IP could've been changed |
2020-04-24T10:48:53Z | [20089] Firing 1 - thanos is restarting frequently |
2020-04-24T10:53:11Z | [20091] increased error rates in the pages service |
2020-04-24T14:30:12Z | [20093] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-24T14:30:13Z | [20094] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-24T17:09:04Z | [20099] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-24T17:09:34Z | [20100] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-04-25T15:25:50Z | [20143] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-04-26T14:52:35Z | [20198] Firing 1 - Increased Error Rate Across Fleet |
2020-04-27T06:41:58Z | [20216] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
7 Day Issue Stats
- Oncall issues : 1
- Access Request : 1
- Change Issues : 0
- Incident Issues : 20
- CorrectiveAction Issues : 0
Change Issues
Incident Issues
- 2020-04-27T08:04:30Z - Firing 1 - The
patroni
service (main
stage) has a apdex score (latency) below SLO - unassigned | ~S3 | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2010
- 2020-04-26T14:58:44Z - increased error rates in the pages service - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2008
- 2020-04-26T06:34:16Z - Firing 1 - The <code data-sourcepos="64:43-64:52">ci-runners</code> service (<code data-sourcepos="64:65-64:68">main</code> stage) has a apdex score (latency) below SLO - unassigned | ~S3 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2006
- 2020-04-26T06:05:48Z - Firing 1 - The <code data-sourcepos="65:43-65:51">workhorse</code> component of the <code data-sourcepos="65:72-65:74">web</code> service, (<code data-sourcepos="65:88-65:90">cny</code> stage), has a apdex-score burn rate outside of SLO - ggillies | ~S3 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2005
- 2020-04-24T14:46:28Z - repository_import sidekiq jobs failing - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2003
- 2020-04-24T14:46:28Z - repository_import sidekiq jobs failing - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2002
- 2020-04-24T10:53:07Z - increased error rates in the pages service - mwasilewski-gitlab | ~S2 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1999
- 2020-04-24T00:49:28Z - Pingdom check check:https://snowplow.trx.gitlab.net/health is down - unassigned | ~S3 | ServiceSnowplow |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1995
- 2020-04-23T16:11:05Z - 2020-04-23 Recent registry logs are missing from Elasticsearch - unassigned | ~S4 | ServiceContainer Registry |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1990
- 2020-04-23T13:39:09Z - Short bursts of catchall jobs - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1987
- 2020-04-23T11:35:05Z - PostgreSQL replicas are falling behind the primary - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1986
- 2020-04-22T13:20:37Z - Load on file-cny-01 is high - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1982
- 2020-04-22T13:04:13Z - Practice incident - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1981
- 2020-04-22T11:48:45Z - fluentd-elasticsearch DaemonSets flapping (due to failing liveness checks) in multiple envs (including gprd), on multiple nodes - jarv | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1979
- 2020-04-22T09:07:01Z - Lack of observability - unassigned | ~S2 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1978
- 2020-04-22T04:34:11Z - 2020-04-22: Multiple CI jobs using Postgres are failing - ggillies | ~S2 | ServiceCI Runners |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1976
- 2020-04-21T22:39:41Z - 2020-04-21 increased error rates on web-pages-03 and 06 - msmiley | ~S4 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1973
- 2020-04-21T16:03:59Z - ProjectDestroyWorker sidekiq job holds db transaction open for 1 hour - msmiley | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1971
- 2020-04-21T11:37:27Z - increased error rates on web-pages-06 - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1969
- 2020-04-21T09:17:24Z - a small number of rails logs is being rejected by the ES logging cluster - mwasilewski-gitlab | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1968
CorrectiveAction Issues
- 2020-04-22T23:37:59Z - Reduce disk space contention on CI runner VMs created by gsrm - msmiley
Open Issue Stats
- Oncall issues : 5
- Change issues : 1
- Incident issues : 2
- Access Request : 5
- CorrectiveAction : 87
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-04-14T20:25:05Z | cindy | Migrate large projects off file-25-stor-gprd to file-01-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-04-26T06:34:16Z | unassigned | Firing 1 - The ci-runners service (main stage) has a apdex score (latency) below SLO |
2020-04-26T06:05:48Z | ggillies | Firing 1 - The workhorse component of the web service, (cny stage), has a apdex-score burn rate outside of SLO |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-04-23T12:53:53Z | unassigned | Set GITLAB_QA_FORMLESS_LOGIN_TOKEN variable on /etc/gitlab/gitlab.rb on live environments |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2020-03-30T09:26:36Z | hphilipps | RCA: 2020-03-30 Database failover and loss of sync to replicas |
2020-03-23T23:43:57Z | ggillies | Manually remove project |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant