On-Call Handover 2021-10-31 15:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @igorwwwwwwwwwwwwwwwwwwww
- EOC ingress: @T4cC0re
📖 Summary:
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
🔴 Ongoing alerts/incidents:
-
production#5834 (closed) - 2021-10-31: SSLCertExpiresVerySoon - runners prometheus certs expiring in 7 days
-
production#5833 (closed) - 2021-10-29: release.gitlab.net was briefly unavailable
-
production#5817 (closed) - 2021-10-27: Thanos persistent volume slowly filling up
-
production#5757 (closed) - 2021-10-18: GitLab.com notifications delayed
-
production#5754 (closed) - 2021-10-19: The fluentd_log_output SLI of the logging service (
main
stage) has an error rate violating SLO -
production#5745 (closed) - 2021-10-18: Sidekiq delays for urgent-other shard
-
production#5730 (closed) - 2021-10-15: The server_route_manifest_writes SLI of the registry service in region
us-east1-c
has an apdex violating SLO -
production#5657 (closed) - 2021-10-06 Some interrupted Sidekiq jobs going missing
-
https://gitlab.pagerduty.com/incidents/Q0G6FX1NWU6C59 - [#61039] Firing 2 - SSLCertExpiresVerySoon
✅ Resolved alerts/incidents:
- production#5834 (closed) - 2021-10-31: SSLCertExpiresVerySoon - runners prometheus certs expiring in 7 days
🔵 Mitigated incidents:
Collapsed for your convenience
-
production#5832 (closed) - 2021-10-29: Unexplained frontend SLO alert
-
production#5829 (closed) - 2021-10-28: Elasticsearch metric stopped being scraped
-
production#5819 (closed) - 2021-10-27: GSTG Deploy failure due to Sidekiq Ruby Failure
-
production#5812 (closed) - 2021-10-27: Containers for the
git
service,main
are unable to start -
production#5809 (closed) - 2021-10-27: Long-running transactions detected on Patroni
-
production#5808 (closed) - 2021-10-26: Repository mirror update delays
-
production#5804 (closed) - 2021-10-26: The goserver SLI of the gitaly service on node
file-32-stor-gprd.c.gitlab-production.internal
has an error rate violating SLO -
production#5797 (closed) - 2021-10-25: The HPA Desired Replicas resource of the sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit
-
production#5793 (closed) - 2021-10-25: The HPA Desired Replicas resource of the sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
-
production#5773 (closed) - 2021-10-20: KubeServiceClusterScaleupsErrorSLOViolation
-
production#5738 (closed) - 2021-10-18 Gitaly is down on file-58-stor-gprd.c.gitlab-production.internal
-
production#5737 (closed) - 2021-10-17: Long db transactions from sidekiq job Ci::CreateDownstreamPipelineWorker
-
production#5736 (closed) - 2021-10-17: 20201-10-17 thanos is restarting frequently
-
production#5734 (closed) - 2021-10-16 Transactions detected that have been running on
patroni-v12-05-db-gprd.c.gitlab-production.internal
for more than 10m -
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5723
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5720
-
production#5712 (closed) - 2021-10-13: QA smoke test failing on gstg
-
production#5709 (closed) - 2021-10-12: Apdex drop for Gitaly node file-43
-
production#5696 (closed) - 2021-10-11: The cluster_scaleups SLI of the kube service (
main
stage) has an error rate violating SLO -
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5692