On-Call Handover 2021-12-25 15:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @mwasilewski-gitlab
- EOC ingress: @T4cC0re
📖 Summary:
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
🔴 Ongoing alerts/incidents:
-
production#6120 (closed) - 2021-12-25 PostgreSQL errors: could not open relation with OID 1375605314
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6117
-
production#6111 (closed) - 2021-12-23 The proxy SLI of the praefect service (
main
stage) has an apdex violating SLO -
production#6109 (closed) - 2021-12-22 Error pushing Windows container registry images
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6086
-
production#5657 (closed) - 2021-10-06 Some interrupted Sidekiq jobs going missing
✅ Resolved alerts/incidents:
🔵 Mitigated incidents:
Collapsed for your convenience
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6118
-
production#6106 (closed) - 2021-12-22 Slack api issues
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6090
-
production#6079 (closed) - 2021-12-15: Firing 1 - The shard_urgent_cpu_bound SLI of the sidekiq service (main stage) has an apdex violating SLO
-
production#6060 (closed) - 2021-12-13: Blackbox probes for https://staging.gitlab.com/users/sign_in are failing.
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/6048
-
production#6025 (closed) - 2021-12-07: Alertmanager is failing sending notifications
-
production#6021 (closed) - 2021-12-06: traffic cessation alerts for several SLIs
-
production#6016 (closed) - 2021-12-04: The sentry_events SLI of the sentry service (
main
stage) has an apdex violating SLO -
production#6009 (closed) - 2021-12-03: Long-running transactions detected on Patroni
-
production#5980 (closed) - 2021-11-28: Chef client failures have reached critical levels
-
production#5974 (closed) - 2021-11-25: Chef client failures due to cert expiry on repo.iovisor.org
-
production#5952 (closed) - 2021-11-22: Increased error rates across gitlab.com WEB and API services
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5913
-
production#5829 (closed) - 2021-10-28: Elasticsearch metric stopped being scraped
-
production#5812 (closed) - 2021-10-27: Containers for the
git
service,main
are unable to start -
production#5804 (closed) - 2021-10-26: The goserver SLI of the gitaly service on node
file-32-stor-gprd.c.gitlab-production.internal
has an error rate violating SLO -
production#5754 (closed) - 2021-10-19: The fluentd_log_output SLI of the logging service (
main
stage) has an error rate violating SLO -
production#5737 (closed) - 2021-10-17: Long db transactions from sidekiq job Ci::CreateDownstreamPipelineWorker
-
production#5736 (closed) - 2021-10-17: 20201-10-17 thanos is restarting frequently