Snippets Groups Projects

On-Call Handover 2022-02-19 23:00 UTC

On-Call Handover

Brought to you by the Slack slash command: /sre-oncall handover

EOC egress: @alex
EOC ingress: @cindy

Summary:

There were three incidents during my shift.

2022-02-19 - no backup in 30+ hours on patroni-ci (production#6387 - closed) turned out to be a service that is not in use. I silenced the alert.
2022-02-19: sshServices and rails_requests SLO ... (production#6388 - closed) This appears to have been another instance of the error tracking service. I greatly increased the rate limit so that should keep it from happening again.
2022-02-19: Apdex drop for web-pages and google... (production#6389 - closed) appears to have been a one-time spike in traffic. The service recovered on its own.

What (if any) time-critical work is being handed over?

What contextual info may be useful for the next few on-call shifts?

Ongoing alerts/incidents:

production#6389 (closed) - 2022-02-19: Apdex drop for web-pages and google-cloud-storage
production#6388 (closed) - 2022-02-19: sshServices and rails_requests SLO violations
production#6387 (closed) - 2022-02-19 - no backup in 30+ hours on patroni-ci
production#6381 (closed) - 2022-02-18: Creation timestamp of container images shown as null in the UI/API
production#6368 (closed) - 2022-02-16: "walg-basebackup" has failed
production#6348 (closed) - 2022-02-14: The goserver SLI of the gitaly service on node file-54-stor-gprd.c.gitlab-production.internal has an apdex violating SLO
production#6338 (closed) - 2022-02-12: Increased latency from us-east1-d for GCS buckets

Resolved alerts/incidents:

production#6389 (closed) - 2022-02-19: Apdex drop for web-pages and google-cloud-storage
production#6388 (closed) - 2022-02-19: sshServices and rails_requests SLO violations
production#6387 (closed) - 2022-02-19 - no backup in 30+ hours on patroni-ci
https://gitlab.pagerduty.com/incidents/Q084Q5JGT7MIUA - [#62639] Firing 1 - Last successful WAL-G basebackup was seen 30.086102222204207 hours ago for env gprd.
https://gitlab.pagerduty.com/incidents/Q29SLW3NFIZS8F - [#62645] Pingdom check check:https://gitlab.com/api/v4/projects/13083 is down
https://gitlab.pagerduty.com/incidents/Q2SGOYNCLKCZEL - [#62669] Firing 2 - BlackboxProbeFailures

Mitigated incidents:

Unactionable alerts:

Change issues:

In Progress

Closed

Edited 3 years ago by Alex Hanselka

Designs

Child items ...

Activity

ops-gitlab-net added SRE:On-Call label 3 years ago

added SRE:On-Call label
Alex Hanselka changed the description 3 years ago

changed the description
Alex Hanselka assigned to @cindy 3 years ago

assigned to @cindy
Cindy Pallares 🦉 closed 3 years ago

closed
Cindy Pallares 🦉 marked this issue as related to #2501 (closed) 3 years ago

marked this issue as related to #2501 (closed)
Cindy Pallares 🦉 marked this issue as related to #2503 (closed) 3 years ago

marked this issue as related to #2503 (closed)

Please register or sign in to reply

Due date

None

Health status

None

Confidentiality

Confidentiality controls have moved to the issue actions menu () at the top of the page.

0 Participants