OnCall report for period: 2020-03-24 - 2020-03-31

Oncall during this period

Schedule	Username
SRE 8 Hour	Amar Amarsanaa
SRE 8 Hour	Henri Philipps
SRE 8 Hour	Craig Miskell
SRE 8 Hour	Nels Nelson

PagerDuty Incidents

* Number of incidents: **80**

Show/Hide Table

Created	Summary
2020-03-24T06:01:06Z	[19086] Firing 1 - Gitaly error rate is too high: 21.11
2020-03-24T18:31:06Z	[19099] Firing 1 - HPA unable to scale up
2020-03-24T19:39:28Z	[19100] Firing 1 - Last WALE backup was seen 20m 10s ago.
2020-03-24T22:28:56Z	[19107] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica
2020-03-25T02:42:14Z	[19118] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down
2020-03-25T03:13:56Z	[19120] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down
2020-03-25T03:24:26Z	[19121] Firing 1 - postgres-dr-archive-01-db-gprd.c.gitlab-production.internal postgres service appears down
2020-03-25T12:46:50Z	[19128] Firing 1 - Increased Error Rate Across Fleet
2020-03-25T15:23:08Z	[19130] Firing 1 - prometheus is restarting frequently
2020-03-25T15:45:22Z	[19131] Firing 1 - prometheus is unreachable
2020-03-25T18:00:08Z	[19134] Firing 2 - PrometheusUnreachable
2020-03-25T18:09:37Z	[19135] Firing 1 - Prometheus not connected to any Alertmanagers
2020-03-25T18:09:38Z	[19136] Firing 1 - Prometheus not connected to any Alertmanagers
2020-03-25T18:39:58Z	[19138] Firing 1 - prometheus is restarting frequently
2020-03-25T18:48:37Z	[19139] Firing 1 - prometheus is restarting frequently
2020-03-25T19:02:29Z	[19140] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica
2020-03-26T16:12:06Z	[19169] Firing 2 - AlertmanagerNotificationsFailing
2020-03-26T16:12:06Z	[19170] Firing 1 - Alertmanager is failing sending notifications
2020-03-26T18:46:57Z	[19173] Need help rotating creds for gitlab-superuser (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9623)
2020-03-27T00:33:01Z	[19177] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2020-03-27T00:43:16Z	[19179] Firing 1 - Redis master missing for redis-sidekiq
2020-03-27T00:43:30Z	[19180] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2020-03-28T11:10:14Z	[19210] Pingdom check check:https://gitlab.com/gitlab-com/gitlab-com-infrastructure/tree/master is down
2020-03-28T11:10:41Z	[19211] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/tree/master is down
2020-03-28T11:10:47Z	[19212] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down
2020-03-28T11:10:59Z	[19213] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/merge_requests/ is down
2020-03-28T11:12:32Z	[19214] Pingdom check check:https://gitlab.com/projects/new is down
2020-03-28T11:13:02Z	[19215] Pingdom check check:https://gitlab.com/ is down
2020-03-28T11:13:15Z	[19216] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/issues is down
2020-03-28T11:13:44Z	[19217] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down
2020-03-28T11:14:05Z	[19218] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down
2020-03-28T11:24:14Z	[19220] Firing 1 - GitLab.com is down for 2 minutes
2020-03-28T11:24:14Z	[19221] Firing 1 - GitLab.com is down for 2 minutes
2020-03-28T12:20:15Z	[19224] Firing 1 - Chef client failures have reached critical levels
2020-03-28T12:25:14Z	[19225] Firing 1 - GitLab.com is down for 2 minutes
2020-03-28T12:25:14Z	[19226] Firing 1 - GitLab.com is down for 2 minutes
2020-03-28T13:22:06Z	[19228] Firing 1 - Gitaly error rate is too high: 15.18
2020-03-28T13:54:54Z	[19229] Firing 1 - Last WALE backup was seen 20m 0s ago.
2020-03-28T19:32:46Z	[19240] We need to do something about those 429s.
2020-03-28T22:04:07Z	[19246] Firing 1 - High 4xx Error Rate on Front End Web
2020-03-28T22:14:07Z	[19247] Firing 1 - High 4xx Error Rate on Front End Web
2020-03-28T22:21:07Z	[19248] Firing 1 - High 4xx Error Rate on Front End Web
2020-03-29T05:47:08Z	[19261] Firing 1 - 5% disk space left
2020-03-29T15:23:08Z	[19266] Firing 1 - Gitaly error rate is too high: 15.42
2020-03-30T00:34:28Z	[19274] Managerial assistance/authority for CloudFlare issues
2020-03-30T03:39:54Z	[19279] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2020-03-30T03:49:47Z	[19280] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2020-03-30T03:59:54Z	[19281] Firing 1 - Less than 100% of sentinel processes running in the redis-sidekiq cluster
2020-03-30T04:09:44Z	[19282] Firing 10 - PostgreSQL_XLOGConsumptionTooLow
2020-03-30T04:12:56Z	[19283] Firing 10 - PostgreSQL_ReplicationLagBytesTooLarge
2020-03-30T04:15:11Z	[19284] Firing 10 - PostgreSQL_ReplicationLagTooLarge
2020-03-30T04:16:26Z	[19286] Firing 10 - PostgreSQL_CommitRateTooLow
2020-03-30T04:16:50Z	[19287] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m
2020-03-30T04:28:20Z	[19291] Firing 1 - Large number of overdue pull mirror jobs
2020-03-30T04:31:51Z	[19292] Firing 1 - API latency on GitLab.com has been over 1500ms during the last 5m
2020-03-30T04:32:06Z	[19293] Firing 2 - IncreasedErrorRateOtherBackends
2020-03-30T04:35:25Z	[19295] DB outage in progress
2020-03-30T04:35:51Z	[19296] Firing 1 - Git latency on GitLab.com has been over 450ms during the last 5m
2020-03-30T04:37:12Z	[19297] Firing 10 - PostgreSQL_UnusedReplicationSlot
2020-03-30T04:44:46Z	[19298] P1 production incident
2020-03-30T04:53:51Z	[19299] Firing 1 - Git latency on GitLab.com has been over 450ms during the last 5m
2020-03-30T04:54:15Z	[19300] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down
2020-03-30T05:34:11Z	[19302] Firing 1 - patroni-09-db-gprd.c.gitlab-production.internal postgres service appears down
2020-03-30T05:41:26Z	[19303] Firing 9 - PostgreSQL_CommitRateTooLow
2020-03-30T05:41:51Z	[19304] Firing 1 - Web latency on GitLab.com has been over 2s during the last 5m
2020-03-30T05:49:12Z	[19309] Firing 1 -
2020-03-30T05:59:13Z	[19310] Firing 2 - PostgreSQL_ServiceDown
2020-03-30T06:06:56Z	[19312] Firing 5 - PostgreSQL_CommitRateTooLow
2020-03-30T06:27:41Z	[19314] Firing 2 - PostgreSQL_CommitRateTooLow
2020-03-30T09:23:11Z	[19318] Firing 3 - PostgreSQL_ReplicaStaleXmin
2020-03-30T10:11:14Z	[19319] Firing 1 - Chef client failures have reached critical levels
2020-03-30T11:03:31Z	[19320] Rotate gitlab-replicator Credentials Found in patroni.yml Potentially Leaked in Issue
2020-03-30T12:06:58Z	[19323] Firing 1 - Chef client failures have reached critical levels
2020-03-30T13:49:46Z	[19325] Firing 1 - Redis cluster redis-cache is missing instances
2020-03-30T21:01:14Z	[19340] Firing 4 - ChefClientErrorCritical
2020-03-30T21:02:01Z	[19341] Firing 2 - ChefClientErrorCritical
2020-03-31T02:00:12Z	[19344] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica
2020-03-31T03:17:56Z	[19346] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica
2020-03-31T05:01:28Z	[19347] Firing 4 - ChefClientErrorCritical
2020-03-31T05:02:15Z	[19348] Firing 2 - ChefClientErrorCritical

7 Day Issue Stats

Oncall issues : 7
Access Request : 1
Change Issues : 4
Incident Issues : 20
CorrectiveAction Issues : 0

Change Issues

2020-03-29T18:10:32Z - Correct X-Forwarded-For header either at haproxy config level or in cloudflare - nnelson
2020-03-28T22:44:02Z - Revert rate-limiting settings for haproxy - nnelson
2020-03-28T19:39:37Z - Disable rate limiting on HAProxy - nnelson
2020-03-24T15:14:32Z - Delete remaining projects without hashed storage feature which are in the pending delete state - nnelson

Incident Issues

2020-03-30T17:01:13Z - 2020-03-30: The sidekiq service (main stage) has a apdex score (latency) below SLO - nnelson | ~S3 | ServiceSidekiq | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1872
2020-03-30T14:33:23Z - Accidental reboot of redis-cache-03, followed by failover of redis-cache-01 - unassigned | ~S4 | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1870
2020-03-30T14:33:11Z - Accidental reboot of redis-cache-03, followed by failover of redis-cache-01 - unassigned | ~S4 | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1869
2020-03-30T14:33:06Z - Accidental reboot of redis-cache-03, followed by failover of redis-cache-01 - unassigned | ~S4 | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1868
2020-03-30T04:19:22Z - 2020-03-30 Database failover and loss of sync to replicas - cmiskell | ~S1 | ServicePostgres | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1865
2020-03-29T15:29:05Z - 2020-03-29: Gitaly error rate on <code data-sourcepos="121:61-121:70">nfs-file46</code> is too high: 15.42 - nnelson | ~S4 | ServiceGitaly | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1859
2020-03-29T06:22:41Z - 2020-03-29 /var/log almost full on git-09-sv-gprd - cmiskell | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1858
2020-03-28T22:24:51Z - 2020-03-28: High 4xx Error Rate on Front End Web - nnelson | ~S4 | ServiceHAProxy | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1856
2020-03-28T13:34:32Z - High CPU usage on file-46 and gitaly error rate above SLO - unassigned | ~S3 | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1854
2020-03-27T13:00:35Z - ExternalDiffUploader throwing 500s - unassigned | ~S3 | ServiceWeb | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1852
2020-03-26T16:19:32Z - 2020-03-24: Alertmanager is seeing errors for integration webhook & ci-runners service (main stage) has a apdex score (latency) below SLO - nnelson | ~S4 | ServiceCI Runners | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1846
2020-03-26T06:48:34Z - 2020-03-26 The sidekiq service (main stage) has a apdex score (latency) below SLO - cmiskell | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1842
2020-03-25T15:46:13Z - 2020-03-25: Prometheus is restarting frequently & Prometheus is unreachable - nnelson | ~S4 | ServicePrometheus | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1838
2020-03-25T11:57:36Z - Massive API requests to a single endpoint causing SLO alerts - unassigned | ~S3 | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1832
2020-03-24T22:33:28Z - 2020-03-24: Postgres Replication lag is over 3 hours on archive recovery replica - nnelson | ~S4 | ServicePostgres | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1827
2020-03-24T20:08:57Z - 2020-03-24: Last WALE backup was seen 20m 10s ago - nnelson | ~S4 | ServicePostgres | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1826
2020-03-24T18:40:20Z - 2020-03-24: HPA unable to scale up - nnelson | ~S4 | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1825
2020-03-24T18:23:03Z - 2020-03-24: Potential password spraying activity - nnelson | ~S4 | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1824
2020-03-24T13:58:33Z - Elevated API request rate leading to SLO violations - hphilipps | ~S3 | ServiceAPI | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1821
2020-03-24T06:22:03Z - 2020-03-24: Gitaly error rate is too high: 21.11 - cmiskell | | | https://gitlab.com/gitlab-com/gl-infra/production/-/issues/1819

CorrectiveAction Issues

2020-03-25T19:37:28Z - Split cookbook publishing MRs - alejandro

Open Issue Stats

Open Change Issues

Show/Hide Table

Created	Assignee	Summary
2020-03-24T15:14:32Z	nnelson	Delete remaining projects without hashed storage feature which are in the pending delete state
2020-01-14T23:08:17Z	nnelson	Migrate large projects off file-35-stor-gprd to file-45-stor-gprd

Open Incident Issues

Show/Hide Table

Created	Assignee	Summary
2020-03-22T05:52:53Z	ggillies	Disk filling up on web-33-sv-gprd.c.gitlab-production.internal
2020-03-17T14:00:31Z	bjk-gitlab	Inconsistencies between responses returned in Grafana, Prometheus and Thanos

Open Oncall Issues

Show/Hide Table

Created	Assignee	Summary
2020-03-30T13:38:11Z	brentnewton	jobs.gitlab.com cert expired unnoticed on 2020-03-28
2020-03-30T09:26:36Z	hphilipps	RCA: 2020-03-30 Database failover and loss of sync to replicas
2020-03-27T21:48:14Z	cynthia	Import request (for SCGTS): q2c (itimsxgen)
2020-03-27T21:43:04Z	cynthia	Import request (for SCGTS): q2c (webapps)
2020-03-26T19:51:21Z	unassigned	Fix access scopes on postgres-dr-delayed-01-db-gprd
2020-03-24T17:05:33Z	unassigned	Project deletion required
2020-03-23T23:43:57Z	unassigned	Manually remove project
2020-01-15T20:57:26Z	devin	Tracking state of mod security on version.gitlab.com for WAF Troubleshooting
2019-10-23T13:05:14Z	cmcfarland	cleanup registered nodes in chef

This issue was automatically generated using oncall-robot-assistant

Edited Mar 31, 2020 by AnthonySandoval