On-Call Handover 2021-09-14 23:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover

Summary:

What (if any) time-critical work is being handed over?
Nothing despite the k8s chart rollout, which you are already aware of :)
What contextual info may be useful for the next few on-call shifts?

Ongoing alerts/incidents:


Resolved alerts/incidents:

-
production#5539 (closed) - 2021-09-14: Elevated error rates for GitLab.com
-
https://gitlab.pagerduty.com/incidents/PDLEPF1 - [#60422] Pingdom check check:gitlab-org/gitlab-foss#1 (closed) is down
-
https://gitlab.pagerduty.com/incidents/PK7Z5O7 - [#60423] Firing 3 - BlackboxProbeFailures

Mitigated incidents:

Collapsed for your convenience
-
production#5539 (closed) - 2021-09-14: Elevated error rates for GitLab.com
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5535
-
production#5530 (closed) - 2021-09-14 ops.gitlab.net certificate inconsistencies
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5523
-
production#5520 (closed) - 2021-09-13: The grpc_requests SLI of the kas service (
main
stage) has an error rate violating SLO -
production#5474 (closed) - 2021-09-03: The imagescaler SLI of the web service in region
us-east1-d
has an apdex violating SLO -
production#5462 (closed) - 2021-09-02: Apdex dip on Gitaly node file-praefect-02
-
production#5438 (closed) - 2021-08-30: The gitlab_zone SLI of the waf service (
main
stage) has an error rate violating SLO -
production#5245 (closed) - 2021-07-28: Frequent dead man's snitch failures on database backup non-existence
-
production#5057 (closed) - 2021-07-01: Apdex and error rate spikes on 3 praefect-fronted gitaly nodes

Unactionable alerts:


Change issues:

In Progress
- production#5481 (closed) - Upgrade OS on Consul Nodes [GSTG]
Closed
-
production#5503 (closed) - 2021-09-08: Update postgres backup buckets retention policies
-
production#5393 (closed) - [gprd] Enable
pg_stat_statements
andpg_stat_activity
fluentd
plugin on a singlepatroni
node (the backup node)