OnCall report for period: 2020-02-25 - 2020-03-03

Oncall during this period

Schedule	Username
SRE 8 Hour	Alejandro Rodriguez
SRE 8 Hour	Devin Sylva
SRE 8 Hour	Craig Barrett
SRE 8 Hour	Ben Kochie

PagerDuty Incidents

* Number of incidents: **25**

Show/Hide Table

Created	Summary
2020-02-25T18:59:18Z	[17888] Firing 1 - CPU use percent is extremely high on fe-17-lb-gprd.c.gitlab-production.internal for the past 2 hours.
2020-02-25T19:47:56Z	[17889] Firing 1 - Postgres Replication lag is over 3 hours on archive recovery replica
2020-02-25T23:29:13Z	[17892] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours)
2020-02-26T00:30:38Z	[17893] Firing 1 - SSL certificate for https://packages.gitlab.com expires in 23h 29m 58s
2020-02-26T15:11:18Z	[17908] Firing 1 - CPU use percent is extremely high on fe-24-lb-gprd.c.gitlab-production.internal for the past 2 hours.
2020-02-26T15:16:21Z	[17909] Firing 1 - Alertmanager is failing sending notifications
2020-02-26T15:31:23Z	[17911] Firing 1 - Alertmanager is failing sending notifications
2020-02-26T15:36:22Z	[17912] Firing 2 - AlertmanagerNotificationsFailing
2020-02-27T11:38:15Z	[17944] AWS Key Compromis - Coin Mining
2020-02-28T03:11:14Z	[17979] Firing 1 - Last WALE backup was seen 18m 44s ago.
2020-02-28T14:09:14Z	[17990] Firing 1 - CPU use percent is extremely high on fe-24-lb-gprd.c.gitlab-production.internal for the past 2 hours.
2020-02-29T08:14:11Z	[18006] Firing 1 - Postgres Replication lag is over 9 hours on delayed replica (normal is 8 hours)
2020-02-29T09:56:00Z	[18011] Firing 1 - Connection of Redis replicas to the master is flapping
2020-02-29T15:02:14Z	[18024] S1/P1 RCE found impacting Gitlab.com
2020-03-01T07:37:15Z	[18045] Firing 1 - CPU use percent is extremely high on influxdb-01-inf-gprd.c.gitlab-production.internal for the past 2 hours.
2020-03-01T20:34:56Z	[18066] Pingdom check check:https://gitlab.com/gitlab-org/gitlab-foss/ is down
2020-03-02T10:45:12Z	[18081] Firing 1 - sentry.gitlab.net is down
2020-03-02T16:15:21Z	[18087] Firing 1 - Alertmanager is failing sending notifications
2020-03-02T16:15:50Z	[18088] Firing 1 - Alertmanager is failing sending notifications
2020-03-02T23:03:58Z	[18089] Firing 1 - Chef client failures have reached critical levels
2020-03-02T23:09:30Z	[18091] Firing 1 - Chef client failures have reached critical levels
2020-03-02T23:29:31Z	[18092] Firing 8 - ChefClientErrorCritical
2020-03-02T23:33:58Z	[18094] Firing 25 - ChefClientErrorCritical
2020-03-03T03:08:15Z	[18099] Firing 1 - CPU use percent is extremely high on pubsub-workhorse-inf-gprd.c.gitlab-production.internal for the past 2 hours.
2020-03-03T05:32:18Z	[18100] Firing 1 - CPU use percent is extremely high on pubsub-workhorse-inf-gprd.c.gitlab-production.internal for the past 2 hours.

7 Day Issue Stats

Oncall issues : 0
Access Request : 0
Change Issues : 0
Incident Issues : 7
CorrectiveAction Issues : 0

Change Issues

Incident Issues

2020-02-29T16:47:36Z - 2020-02-29: S1/P1 security incident - unassigned | ~S1 | ServiceWeb | https://gitlab.com/gitlab-com/gl-infra/production/issues/1718
2020-02-28T15:17:08Z - 2020-02-28: haproxy is saturating cpu on fe-[17,24] - alejandro | ~S3 | ServiceHAProxy | https://gitlab.com/gitlab-com/gl-infra/production/issues/1716
2020-02-27T15:34:39Z - 2020-02-27 The web service, workhorse component, main stage, has an error burn-rate exceeding SLO - alejandro | ~S3 | ServiceWeb | https://gitlab.com/gitlab-com/gl-infra/production/issues/1712
2020-02-26T22:52:31Z - Unable to push to repositories: rpc error: code = Unknown desc = invalid correlation ID - unassigned | | ServiceGitaly | https://gitlab.com/gitlab-com/gl-infra/production/issues/1710
2020-02-26T15:45:35Z - 2020-02-26 Alertmanager is failing sending notifications - alejandro | ~S4 | ~"Service::ELK" | https://gitlab.com/gitlab-com/gl-infra/production/issues/1707
2020-02-25T20:17:07Z - 2020-02-25: Postgres Replication lag is over 3 hours on archive recovery replica - unassigned | ~S4 | ServicePostgres | https://gitlab.com/gitlab-com/gl-infra/production/issues/1703
2020-02-25T19:05:34Z - 2020-02-25: CPU use percent extremely high on fe-17-lb-gprd.c.gitlab-production.internal - alejandro | ~S3 | ServiceHAProxy | https://gitlab.com/gitlab-com/gl-infra/production/issues/1702

CorrectiveAction Issues

2020-02-28T20:16:56Z - Document Cloud Provider Abuse Reporting Procedures - unassigned
2020-02-26T17:11:36Z - replay in staging performance tests that led to high cpu on redis - unassigned
2020-02-25T18:09:06Z - Consider using RedisInsight for better monitoring and profiling of Redis - unassigned

Open Issue Stats

Open Change Issues

Show/Hide Table

Created	Assignee	Summary
2019-10-16T14:37:43Z	nnelson	Migrate large projects off file-33-stor-gprd to file-43-stor-gprd
2019-10-15T15:13:30Z	nnelson	Migrate large projects off file-34-stor-gprd to file-44-stor-gprd

Open Incident Issues

Show/Hide Table

Created	Assignee	Summary
2020-02-29T16:47:36Z	unassigned	2020-02-29: S1/P1 security incident
2020-02-25T20:17:07Z	unassigned	2020-02-25: Postgres Replication lag is over 3 hours on archive recovery replica

Open Oncall Issues

Show/Hide Table

Created	Assignee	Summary
2020-02-13T23:56:43Z	unassigned	Import request (for red61): via-server
2020-02-12T16:00:37Z	dawsmith	dev.gitlab.org - Admins Export
2020-01-16T06:07:03Z	aamarsanaa	Incremental rollout for the Pages new API based config source
2020-01-15T20:57:26Z	devin	Tracking state of mod security on version.gitlab.com for WAF Troubleshooting
2019-10-23T13:05:14Z	cmcfarland	cleanup registered nodes in chef

This issue was automatically generated using oncall-robot-assistant