OnCall report for period: 2020-04-14 - 2020-04-21
Oncall during this period
Schedule | Username |
---|---|
SRE 8 Hour | Ahmad Sherif |
SRE 8 Hour | Cindy Pallares |
SRE 8 Hour | Devin Sylva |
SRE 8 Hour | Graeme Gillies |
PagerDuty Incidents
* Number of incidents: **24**
Show/Hide Table
Created | Summary |
---|---|
2020-04-14T10:22:27Z | [19894] Firing 1 - Last WALE backup was seen 20m 1s ago. |
2020-04-14T10:32:42Z | [19895] Firing 1 - Last WALE backup was seen 30m 1s ago. |
2020-04-14T10:33:13Z | [19896] Firing 1 - Last WALE backup was seen 30m 1s ago. |
2020-04-14T15:28:01Z | [19901] Firing 1 - Less than 100% of sentinel processes running in the redis cluster |
2020-04-14T16:06:35Z | [19902] Firing 1 - Increased Error Rate Across Fleet |
2020-04-14T16:17:15Z | [19903] Firing 1 - Connection of Redis replicas to the master is flapping |
2020-04-14T16:32:01Z | [19904] Firing 1 - Less than 100% of sentinel processes running in the redis cluster |
2020-04-14T16:37:45Z | [19905] Firing 1 - Less than 100% of sentinel processes running in the redis cluster |
2020-04-14T16:42:45Z | [19906] Firing 1 - Less than 100% of sentinel processes running in the redis cluster |
2020-04-14T16:47:45Z | [19907] Firing 1 - Less than 100% of sentinel processes running in the redis cluster |
2020-04-14T19:01:20Z | [19911] Firing 1 - Increased Error Rate Across Fleet |
2020-04-14T20:56:06Z | [19912] Firing 1 - Gitaly latency on file-praefect-02-stor-gprd.c.gitlab-production.internal has been over 1m during the last 5m |
2020-04-16T03:30:58Z | [19940] Firing 1 - Chef client failures have reached critical levels |
2020-04-16T10:50:00Z | [19946] HAProxy block required for P1/S1 - gitlab-org/gitlab#214636 (closed) |
2020-04-16T11:31:14Z | [19948] Firing 1 - Chef client failures have reached critical levels |
2020-04-18T06:27:58Z | [19962] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-04-19T10:48:58Z | [19976] Firing 1 - The patroni service (main stage) has a apdex score (latency) below SLO |
2020-04-20T16:46:15Z | [19992] Firing 1 - Last WALE backup was seen 20m 8s ago. |
2020-04-20T18:25:29Z | [19994] Firing 1 - Chef client failures have reached critical levels |
2020-04-20T18:46:28Z | [19995] Firing 1 - Chef client failures have reached critical levels |
2020-04-20T19:37:56Z | [19996] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica |
2020-04-20T23:11:05Z | [19998] Firing 1 - Gitaly error rate is too high: 11.15 |
2020-04-21T03:29:28Z | [20003] Firing 1 - The waf service, zone_gitlab_net component, main stage, has an error burn-rate exceeding SLO |
2020-04-21T03:35:30Z | [20004] Firing 1 - The waf service, zone_gitlab_net component, main stage, has an error burn-rate exceeding SLO |
7 Day Issue Stats
- Oncall issues : 2
- Access Request : 0
- Change Issues : 2
- Incident Issues : 7
- CorrectiveAction Issues : 0
Change Issues
- 2020-04-14T20:25:05Z - Migrate large projects off file-25-stor-gprd to file-01-stor-gprd - cindy
- 2020-04-14T15:17:59Z - Create new gitaly storage shard node
file-50-stor-gprd
to replacefile-44-stor-gprd
in the configured rotation for storing new projects - cmcfarland
Incident Issues
- 2020-04-20T23:26:27Z - Firing 1 - Gitaly error rate is too high: 11.15 - unassigned | ~S3 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1966
- 2020-04-20T20:27:17Z - 2020-04-20: Postgres Replication lag is over 3 hours on archive recovery replica - unassigned | ~S4 | ServicePostgres |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1964
- 2020-04-17T16:29:48Z - 2020-04-17: Increased error rate in GitLab pages - unassigned | ~S4 | ServicePages |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1952
- 2020-04-16T10:10:09Z - 2020-04-15: Cloudflare metrics and logs missing, Cloudflare API unavailable - unassigned | ~S4 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1947
- 2020-04-15T13:38:56Z - High rate of 400 errors uploading CI artifacts - unassigned | ~S1 | |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1944
- 2020-04-14T21:28:26Z - 2020-04-14: Elevated errors for API - unassigned | ~S4 | ServiceAPI |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1936
- 2020-04-14T07:41:25Z - 2020-04-14: Sidekiq urgent-other fleet is not keeping up with workload - andrewn | ~S4 | ServiceSidekiq |
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1930
CorrectiveAction Issues
- 2020-04-20T13:02:38Z - Alert on high percentage of 400 errors for a particular endpoint - ahmadsherif
- 2020-04-20T11:00:15Z - Consider adding a QA step after running DB migration on staging - unassigned
- 2020-04-16T15:36:30Z - Improve investigative techniques using correlation ID, tracing, and APM - AnthonySandoval
- 2020-04-16T13:53:03Z - Probe Google Cloud API for expected responses from key cloud service dependencies - unassigned
Open Issue Stats
- Oncall issues : 5
- Change issues : 1
- Incident issues : 2
- Access Request : 4
- CorrectiveAction : 87
Open Change Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-04-14T20:25:05Z | cindy | Migrate large projects off file-25-stor-gprd to file-01-stor-gprd |
Open Incident Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-04-20T23:26:27Z | unassigned | Firing 1 - Gitaly error rate is too high: 11.15 |
2020-03-17T14:00:31Z | AnthonySandoval | Inconsistencies between responses returned in Grafana, Prometheus and Thanos |
Open Oncall Issues
Show/Hide Table
Created | Assignee | Summary |
---|---|---|
2020-03-30T13:38:11Z | brentnewton | jobs.gitlab.com cert expired unnoticed on 2020-03-28 |
2020-03-30T09:26:36Z | hphilipps | RCA: 2020-03-30 Database failover and loss of sync to replicas |
2020-03-23T23:43:57Z | ggillies | Manually remove project |
2020-01-15T20:57:26Z | devin | Tracking state of mod security on version.gitlab.com for WAF Troubleshooting |
2019-10-23T13:05:14Z | cmcfarland | cleanup registered nodes in chef |
This issue was automatically generated using oncall-robot-assistant