On-Call Handover 2022-02-19 23:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover

Summary:

There were three incidents during my shift.
- 2022-02-19 - no backup in 30+ hours on patroni-ci (production#6387 - closed) turned out to be a service that is not in use. I silenced the alert.
- 2022-02-19: sshServices and rails_requests SLO ... (production#6388 - closed) This appears to have been another instance of the error tracking service. I greatly increased the rate limit so that should keep it from happening again.
- 2022-02-19: Apdex drop for web-pages and google... (production#6389 - closed) appears to have been a one-time spike in traffic. The service recovered on its own.
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?

Ongoing alerts/incidents:

-
production#6389 (closed) - 2022-02-19: Apdex drop for web-pages and google-cloud-storage
-
production#6388 (closed) - 2022-02-19: sshServices and rails_requests SLO violations
-
production#6387 (closed) - 2022-02-19 - no backup in 30+ hours on patroni-ci
-
production#6381 (closed) - 2022-02-18: Creation timestamp of container images shown as null in the UI/API
-
production#6368 (closed) - 2022-02-16: "walg-basebackup" has failed
-
production#6348 (closed) - 2022-02-14: The goserver SLI of the gitaly service on node
file-54-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#6338 (closed) - 2022-02-12: Increased latency from us-east1-d for GCS buckets

Resolved alerts/incidents:

-
production#6389 (closed) - 2022-02-19: Apdex drop for web-pages and google-cloud-storage
-
production#6388 (closed) - 2022-02-19: sshServices and rails_requests SLO violations
-
production#6387 (closed) - 2022-02-19 - no backup in 30+ hours on patroni-ci
-
https://gitlab.pagerduty.com/incidents/Q084Q5JGT7MIUA - [#62639] Firing 1 - Last successful WAL-G basebackup was seen 30.086102222204207 hours ago for env gprd.
-
https://gitlab.pagerduty.com/incidents/Q29SLW3NFIZS8F - [#62645] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down
-
https://gitlab.pagerduty.com/incidents/Q2SGOYNCLKCZEL - [#62669] Firing 2 - BlackboxProbeFailures

Mitigated incidents:


Unactionable alerts:


Change issues:
