On-Call Handover 2021-10-27 23:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
📖 Summary:
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
🔴 Ongoing alerts/incidents:
-
production#5819 (closed) - 2021-10-27: GSTG Deploy failure due to Sidekiq Ruby Failure
-
production#5817 (closed) - 2021-10-27: Thanos persistent volume slowly filling up
-
production#5813 (closed) - 2021-10-27: 500 short spike across web services
-
production#5808 (closed) - 2021-10-26: Repository mirror update delays
-
production#5757 (closed) - 2021-10-18: GitLab.com notifications delayed
-
production#5754 (closed) - 2021-10-19: The fluentd_log_output SLI of the logging service (
main
stage) has an error rate violating SLO -
production#5745 (closed) - 2021-10-18: Sidekiq delays for urgent-other shard
-
production#5730 (closed) - 2021-10-15: The server_route_manifest_writes SLI of the registry service in region
us-east1-c
has an apdex violating SLO -
production#5657 (closed) - 2021-10-06 Some interrupted Sidekiq jobs going missing
-
https://gitlab.pagerduty.com/incidents/Q2N3Z42C1YKYRU - [#61003] Firing 1 - The Kube Persistent Volume Claim Space Utilisation resource of the kube service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
✅ Resolved alerts/incidents:
-
production#5819 (closed) - 2021-10-27: GSTG Deploy failure due to Sidekiq Ruby Failure
-
production#5817 (closed) - 2021-10-27: Thanos persistent volume slowly filling up
🔵 Mitigated incidents:
Collapsed for your convenience
-
production#5812 (closed) - 2021-10-27: Containers for the
git
service,main
are unable to start -
production#5809 (closed) - 2021-10-27: Long-running transactions detected on Patroni
-
production#5804 (closed) - 2021-10-26: The goserver SLI of the gitaly service on node
file-32-stor-gprd.c.gitlab-production.internal
has an error rate violating SLO -
production#5797 (closed) - 2021-10-25: The HPA Desired Replicas resource of the sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit
-
production#5793 (closed) - 2021-10-25: The HPA Desired Replicas resource of the sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
-
production#5773 (closed) - 2021-10-20: KubeServiceClusterScaleupsErrorSLOViolation
-
production#5744 (closed) - 2021-10-18 GraphQL performance issue in Vulnerability Reports
-
production#5738 (closed) - 2021-10-18 Gitaly is down on file-58-stor-gprd.c.gitlab-production.internal
-
production#5737 (closed) - 2021-10-17: Long db transactions from sidekiq job Ci::CreateDownstreamPipelineWorker
-
production#5736 (closed) - 2021-10-17: 20201-10-17 thanos is restarting frequently
-
production#5734 (closed) - 2021-10-16 Transactions detected that have been running on
patroni-v12-05-db-gprd.c.gitlab-production.internal
for more than 10m -
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5723
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5720
-
production#5712 (closed) - 2021-10-13: QA smoke test failing on gstg
-
production#5709 (closed) - 2021-10-12: Apdex drop for Gitaly node file-43
-
production#5696 (closed) - 2021-10-11: The cluster_scaleups SLI of the kube service (
main
stage) has an error rate violating SLO -
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5692
-
production#5671 (closed) - 2021-10-07: about.gitlab.com missing CSS because of NPM registry issue
-
production#5583 (closed) - 2021-09-23: The goserver SLI of the gitaly service on node
file-43-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5542 (closed) - 2021-09-15: The HPA Desired Replicas resource of the sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit
⚪ Unactionable alerts:
🔓 Change issues:
In Progress
Closed
-
production#5278 (closed) - 2021-08-03: Resize PVCs for org-ci thanos-store instances
-
production#5239 (closed) - 2021-07-28: Resize PVCs for ops thanos instance