On-Call Handover 2021-10-31 15:00 UTC

On-Call Handover

Brought to you by the Slack slash command: /sre-oncall handover

EOC egress: @igorwwwwwwwwwwwwwwwwwwww
EOC ingress: @T4cC0re

📖 Summary:

What (if any) time-critical work is being handed over?

What contextual info may be useful for the next few on-call shifts?

🔴 Ongoing alerts/incidents:

production#5834 (closed) - 2021-10-31: SSLCertExpiresVerySoon - runners prometheus certs expiring in 7 days
production#5833 (closed) - 2021-10-29: release.gitlab.net was briefly unavailable
production#5817 (closed) - 2021-10-27: Thanos persistent volume slowly filling up
production#5757 (closed) - 2021-10-18: GitLab.com notifications delayed
production#5754 (closed) - 2021-10-19: The fluentd_log_output SLI of the logging service (main stage) has an error rate violating SLO
production#5745 (closed) - 2021-10-18: Sidekiq delays for urgent-other shard
production#5730 (closed) - 2021-10-15: The server_route_manifest_writes SLI of the registry service in region us-east1-c has an apdex violating SLO
production#5657 (closed) - 2021-10-06 Some interrupted Sidekiq jobs going missing
https://gitlab.pagerduty.com/incidents/Q0G6FX1NWU6C59 - [#61039] Firing 2 - SSLCertExpiresVerySoon

✅ Resolved alerts/incidents:

production#5834 (closed) - 2021-10-31: SSLCertExpiresVerySoon - runners prometheus certs expiring in 7 days

🔵 Mitigated incidents:

Collapsed for your convenience

production#5832 (closed) - 2021-10-29: Unexplained frontend SLO alert
production#5829 (closed) - 2021-10-28: Elasticsearch metric stopped being scraped
production#5819 (closed) - 2021-10-27: GSTG Deploy failure due to Sidekiq Ruby Failure
production#5812 (closed) - 2021-10-27: Containers for the git service, main are unable to start
production#5809 (closed) - 2021-10-27: Long-running transactions detected on Patroni
production#5808 (closed) - 2021-10-26: Repository mirror update delays
production#5804 (closed) - 2021-10-26: The goserver SLI of the gitaly service on node file-32-stor-gprd.c.gitlab-production.internal has an error rate violating SLO
production#5797 (closed) - 2021-10-25: The HPA Desired Replicas resource of the sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit
production#5793 (closed) - 2021-10-25: The HPA Desired Replicas resource of the sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
production#5773 (closed) - 2021-10-20: KubeServiceClusterScaleupsErrorSLOViolation
production#5738 (closed) - 2021-10-18 Gitaly is down on file-58-stor-gprd.c.gitlab-production.internal
production#5737 (closed) - 2021-10-17: Long db transactions from sidekiq job Ci::CreateDownstreamPipelineWorker
production#5736 (closed) - 2021-10-17: 20201-10-17 thanos is restarting frequently
production#5734 (closed) - 2021-10-16 Transactions detected that have been running on patroni-v12-05-db-gprd.c.gitlab-production.internal for more than 10m
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5723
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5720
production#5712 (closed) - 2021-10-13: QA smoke test failing on gstg
production#5709 (closed) - 2021-10-12: Apdex drop for Gitaly node file-43
production#5696 (closed) - 2021-10-11: The cluster_scaleups SLI of the kube service (main stage) has an error rate violating SLO
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5692

On-Call Handover 2021-10-31 15:00 UTC

On-Call Handover

📖 Summary:

What (if any) time-critical work is being handed over?

What contextual info may be useful for the next few on-call shifts?

🔴 Ongoing alerts/incidents:

✅ Resolved alerts/incidents:

🔵 Mitigated incidents:

⚪ Unactionable alerts:

🔓 Change issues:

In Progress

Closed