OnCall report for period: 2020-05-05 - 2020-05-12
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Alex Hanselka |
SRE 8 Hour | Craig Barrett |
SRE 8 Hour | Amar Amarsanaa |
SRE 8 Hour | Henri Philipps |
PagerDuty Incidents
* Number of incidents: **94**
Show/Hide Table
Created | Summary |
---|---|
2020-05-05T10:58:58Z | [20421] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-05-05T11:43:21Z | [20430] Firing 1 - Alertmanager is failing sending notifications |
2020-05-05T11:43:21Z | [20431] Firing 1 - Alertmanager is failing sending notifications |
2020-05-05T11:58:20Z | [20437] Firing 1 - Alertmanager is failing sending notifications |
2020-05-05T12:08:22Z | [20440] Firing 1 - Alertmanager is failing sending notifications |
2020-05-05T12:18:22Z | [20443] Firing 2 - AlertmanagerNotificationsFailing |
2020-05-05T12:18:23Z | [20444] Firing 2 - AlertmanagerNotificationsFailing |
2020-05-05T13:05:58Z | [20449] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-05T16:07:58Z | [20453] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-05T17:43:52Z | [20457] Firing 1 - prometheus is unreachable |
2020-05-05T18:32:37Z | [20458] Firing 1 - prometheus is restarting frequently |
2020-05-06T04:00:56Z | [20472] Firing 1 - PostgreSQL dead tuples is too large |
2020-05-06T05:45:50Z | [20474] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-06T08:01:58Z | [20480] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-06T08:03:58Z | [20481] Firing 1 - The sidekiq service (main stage) has an error-ratio exceeding SLO |
2020-05-06T12:43:11Z | [20489] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-06T14:26:50Z | [20491] Firing 1 - HAProxy process high CPU usage on fe-registry-02-lb-gprd.c.gitlab-production.internal |
2020-05-06T15:37:34Z | [20495] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-06T15:37:35Z | [20496] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-06T17:10:58Z | [20500] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-05-06T18:02:05Z | [20501] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-06T18:15:57Z | [20502] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-06T20:21:58Z | [20507] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-06T20:43:28Z | [20508] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-07T00:47:58Z | [20512] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T01:20:58Z | [20513] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T04:28:11Z | [20514] Pingdom check check:https://gitlab.com/ is down |
2020-05-07T04:29:17Z | [20515] Pingdom check check:https://about.gitlab.com/ is down |
2020-05-07T04:31:14Z | [20516] Pingdom check check:http://about.gitlab.com/ is down |
2020-05-07T04:33:16Z | [20517] Firing 1 - www.gitlab.com is down for 2 minutes |
2020-05-07T04:33:21Z | [20518] Pingdom check check:http://gitlab.org/ is down |
2020-05-07T04:41:58Z | [20519] Firing 1 - Chef client failures have reached critical levels |
2020-05-07T04:43:41Z | [20520] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-07T04:59:58Z | [20521] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T05:05:57Z | [20522] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T07:37:58Z | [20523] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T08:27:58Z | [20524] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T08:52:56Z | [20525] Firing 1 - WAL-E replication has stopped |
2020-05-07T13:43:21Z | [20529] Firing 1 - HAProxy process high CPU usage on fe-registry-02-lb-gprd.c.gitlab-production.internal |
2020-05-07T13:49:58Z | [20530] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T14:36:04Z | [20532] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-07T14:36:05Z | [20533] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-07T16:05:58Z | [20534] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T16:23:58Z | [20535] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T17:22:58Z | [20536] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T19:31:58Z | [20538] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T19:50:00Z | [20540] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T20:15:58Z | [20542] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-07T21:19:23Z | [20543] Firing 1 - Increased Error Rate Across Fleet |
2020-05-08T02:53:40Z | [20547] Firing 1 - Gitaly is down on file-52-stor-gprd.c.gitlab-production.internal |
2020-05-08T03:46:57Z | [20548] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-08T06:43:58Z | [20549] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-08T08:20:58Z | [20550] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-08T10:33:12Z | [20552] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-08T11:46:58Z | [20554] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-08T12:07:58Z | [20555] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-08T12:34:53Z | [20556] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-08T14:34:35Z | [20564] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-08T14:34:35Z | [20565] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-08T14:50:07Z | [20567] Firing 1 - staging.GitLab.com is down for 30 minutes |
2020-05-08T15:08:58Z | [20568] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-08T16:05:58Z | [20570] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-08T20:33:21Z | [20579] Firing 1 - Alertmanager is failing sending notifications |
2020-05-08T20:33:50Z | [20580] Firing 1 - Alertmanager is failing sending notifications |
2020-05-09T05:51:58Z | [20591] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-05-09T06:16:50Z | [20592] Firing 1 - Gitaly latency on file-51-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-09T09:48:07Z | [20598] Firing 1 - Gitaly latency on file-51-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-09T09:58:06Z | [20599] Firing 1 - Gitaly latency on file-51-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-09T10:08:06Z | [20600] Firing 1 - Gitaly latency on file-51-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-09T10:17:50Z | [20602] Firing 1 - Gitaly latency on file-51-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-09T10:27:50Z | [20603] Firing 1 - Gitaly latency on file-51-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-09T10:37:50Z | [20604] Firing 1 - Gitaly latency on file-51-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-05-09T16:06:00Z | [20608] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-09T23:55:59Z | [20611] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-10T10:55:26Z | [20625] Firing 5 - PostgreSQL_ExporterErrors |
2020-05-10T18:01:06Z | [20636] Firing 3 - IncreasedErrorRateOtherBackends |
2020-05-10T18:01:42Z | [20637] Firing 2 - GitLabPagesProdFeIpPossibleChange |
2020-05-10T18:04:57Z | [20638] Firing 2 - PostgreSQL_WALEReplicationStopped |
2020-05-10T18:05:55Z | [20639] Firing 1 - High load in database patroni-02-db-gprd.c.gitlab-production.internal: 299.44 |
2020-05-10T18:13:26Z | [20640] Firing 4 - PostgreSQL_ServiceDown |
2020-05-10T18:13:52Z | [20641] Extremely high load on patroni-02 |
2020-05-10T18:15:13Z | [20642] Firing 4 - PatroniIsDown |
2020-05-10T18:50:27Z | [20643] Firing 2 - PostgreSQL_ExporterErrors |
2020-05-10T18:51:56Z | [20644] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours) |
2020-05-11T05:59:58Z | [20652] Firing 1 - The sidekiq service (main stage) has a apdex score (latency) below SLO |
2020-05-11T07:53:42Z | [20654] Firing 1 - sentry.gitlab.net is down |
2020-05-11T13:59:09Z | [20660] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down |
2020-05-11T13:59:29Z | [20661] Firing 1 - HPA unable to scale up |
2020-05-11T15:20:42Z | [20664] Firing 1 - Last WALE backup was seen 42d 11h 14m 2s ago. |
2020-05-11T16:02:17Z | [20667] Firing 1 - Last WALE backup was seen 42d 11h 55m 32s ago. |
2020-05-11T16:07:57Z | [20669] Firing 1 - Last WALE backup was seen 42d 12h 1m 47s ago. |
2020-05-11T17:46:43Z | [20672] Firing 1 - Chef client failures have reached critical levels |
2020-05-12T00:24:57Z | [20680] Firing 1 - The waf service, zone_gitlab_com component, main stage, has an error burn-rate exceeding SLO |
2020-05-12T00:30:31Z | [20682] Firing 1 - SSL certificate for https://license.gitlab.com expires in 23h 29m 58s |
7 Day Issue Stats
- Oncall issues : 10
- Access Request : 1
- Change Issues : 2
- Incident Issues : 15
- CorrectiveAction Issues : 0
Change Issues
- 2020-05-07T14:21:32Z - Create new gitaly storage shard node <code data-sourcepos="123:65-123:81">file-52-stor-gprd</code> to replace <code data-sourcepos="123:96-123:112">file-45-stor-gprd</code> in the configured rotation for storing new projects - nnelson
- 2020-05-06T14:25:22Z - Create new gitaly storage shard node
file-51-stor-gprd
to replacefile-41-stor-gprd
in the configured rotation for storing new projects - nnelson
Incident Issues
- 2020-05-12T00:56:16Z - Stale SSL certificate for gitlab.com - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2110
- 2020-05-11T19:12:47Z - Kibana Down - unassigned | ~S3 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2109
- 2020-05-11T17:32:31Z - td-agent is crashing due to a misconfiguration - unassigned | ~S3 | ServiceMonitoring |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2108
- 2020-05-11T14:44:34Z - 2020-05-11: High gitaly request rate on canary - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2107
- 2020-05-11T11:27:09Z - elastic log prod cluster master node not available - mwasilewski-gitlab | ~S3 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2104
- 2020-05-11T08:19:29Z - Sentry is down - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2102
- 2020-05-10T18:13:51Z - Extremely high load on patroni-02 - unassigned | ~S3 | ServicePatroni |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2101
- 2020-05-09T10:29:45Z - 2020-05-09: gitaly latency on file-51 - unassigned | ~S3 | ServiceGitaly |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2099
- 2020-05-08T17:59:21Z - Kibana is unaccessible - unassigned | ~S2 ~S3 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2098
- 2020-05-07T12:54:13Z - 2020-05-07: Sidekiq INFO logs missing in ELK - hphilipps | ~S4 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2090
- 2020-05-07T04:30:53Z - about.gitlab.com is down - unassigned | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2087
- 2020-05-07T02:05:06Z - PagerDuty alerts: The
sidekiq
service (main
stage) has a apdex score (latency) below SLO - unassigned | ~S3 | |https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2084
- 2020-05-05T18:52:36Z - 2020-05-05: GPRD Kibana returning 502 - AnthonySandoval | ~S2 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2078
- 2020-05-05T12:06:59Z - 2020-05-05: ES prod logging cluster unavailable - hphilipps | ~S3 | ~"Service::ELK" |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2073
- 2020-05-05T11:54:36Z - 2020-05-05: regression in production deployment - hphilipps | ~S2 | ServiceWeb |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/2072
CorrectiveAction Issues
- 2020-05-07T10:10:04Z - Follow up actions on the ILM errors in production - igorwwwwwwwwwwwwwwwwwwww
- 2020-05-07T05:50:35Z - Update runbook: about-gitlab-com.md - aamarsanaa
- 2020-05-06T12:45:04Z - Increase GCP load-balancer timeout in front of logging cluster IAP - ahanselka
Open Issue Stats
- Oncall issues : 4
- Change issues : 2
- Incident issues : 1
- Access Request : 4
- CorrectiveAction : 90
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-05-07T14:21:32Z | nnelson | Create new gitaly storage shard node file-52-stor-gprd to replace file-45-stor-gprd in the configured rotation for storing new projects |
2020-03-26T19:16:25Z | nnelson | Rotate credentials for user gitlab-superuser
|
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-05-04T10:31:56Z | ahmadsherif | 2020-05-04: possible data loss via external diffs migration |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-05-11T06:18:27Z | unassigned | Manually remove project |
2020-04-23T12:53:53Z | cindy | Set GITLAB_QA_FORMLESS_LOGIN_TOKEN variable on /etc/gitlab/gitlab.rb on live environments |
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant