On-Call Handover 2021-04-29 23:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
Summary:
- Relentless miner bots.
- No database failover today, so I got that going for me, which is nice.
- A new zfs-based patroni node has been introduced to our production database fleet in order to ultimately serve as a low-latency replica for data warehouse. It is replicating only from
patroni-08
and should not have any effect on the production main stage database cluster. - This dead tuples alert keeps firing, so watch out for all these dead tuples okay? Seriously though, OnGres says nothing to worry about, and they seem correct. Looking at graphs over the past week, everything seems within normal parameters.
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
Ongoing alerts/incidents:
-
production#4426 (closed) - 2021-04-29: Prometheus has slow rule evaluations
-
production#4423 (closed) - 2021-04-29: PostgreSQL dead tuples is too large
-
production#4421 (closed) - 2021-04-29: Prometheus has slow rule evaluations
-
production#4415 (closed) - 2021-04-29: Prometheus has slow rule evaluations
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4413
-
production#4411 (closed) - 2021-04-29: Prometheus has slow rule evaluations
-
production#4408 (closed) - 2021-04-29: Thanos Rule has high number of evaluation warnings
-
production#4407 (closed) - 2021-04-29: Thanos Rule has high number of evaluation warnings
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4403
-
production#4389 (closed) - 2021-04-28: PostgreSQL dead tuples is too large
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/4382
-
production#4381 (closed) - 2021-04-28: prometheus-metamon has an invalid config
-
production#4378 (closed) - 2021-04-28: Prometheus has slow rule evaluations
-
production#4329 (closed) - 2021-04-24: postgres-dr-archive-01-db-gprd rebooted & is delayed by a few days
-
production#4301 (closed) - 2021-04-22: Thanos Rule has high number of evaluation warnings
-
production#4277 (closed) - 2021-04-19: Archive replica delayed by over 5 days
-
production#4276 (closed) - 2021-04-19 - stop gitlab-exporter in all the patroni nodes
-
production#4253 (moved) - 2021-04-15: CF 520 during large generic package upload
-
https://gitlab.pagerduty.com/incidents/PEZ6DSA - [#41618] Firing 1 - patroni-zfs-01-db-gprd.c.gitlab-production.internal postgres service appears down
-
https://gitlab.pagerduty.com/incidents/PBNRFVE - [#41619] Firing 1 - Postgres exporter is showing errors for the last hour
Resolved actionable alerts:
-
https://gitlab.pagerduty.com/incidents/PWK42MR - [#41614] Please see incident declaration in Slack channel: https://slack.com/app_redirect?channel=CB7P5CJS1&team=T02592416
-
https://gitlab.pagerduty.com/incidents/PRAHAR9 - [#41617] Firing 1 - PostgreSQL dead tuples is too large
Unactionable alerts:
Resolved production incidents:
-
production#4427 (closed) - 2021-04-29: Prometheus has slow rule evaluations
-
production#4422 (closed) - 2021-04-29: Prometheus has slow rule evaluations
-
production#4418 (closed) - 2021-04-29: x509: certificate signed by unknown authority