On-Call Handover 2021-07-14 15:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @ahmadsherif
- EOC ingress: @nnelson
Summary:
Quiet shift all in all. The highlights are:
- An MR in a project is causing an apdex drop in file-42
goserver_op_service
component (the overall apdex for the node is fine). Asked for help from the Gitaly team and a silence is in place. - Alertmanager was being flappy about webhook integration erroring out. This is being investigated in https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13604 and silence is in place as well.
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
Ongoing alerts/incidents:
-
production#5160 (closed) - 2021-07-14: Alertmanager is failing sending notifications
-
production#5159 (closed) - 2021-07-14: goserver_op_service SLI of the gitaly service on file-42 has an apdex violating SLO
-
production#5134 (closed) - 2021-07-09 - CI jobs using alpine 3.14 based images are failing
-
production#5104 (closed) - 2021-07-06: Thanos unhappy
-
production#5079 (closed) - 2021-07-05: Apdex dip on Gitaly node file-37 due to flood of UserMergeToRef calls from single project
-
production#5068 (closed) - 2021-07-02: Intermittent "Internal API unreachable" errors from Gitaly nodes
-
production#5062 (closed) - 2021-07-01: High disk usage by thanos-store persistent-volume-claim
Resolved actionable alerts:
-
https://gitlab.pagerduty.com/incidents/PFHVKP9 - [#51114] Firing 1 - The shard_memory_bound SLI of the sidekiq service (
main
stage) has not received any traffic in the past 30 minutes -
https://gitlab.pagerduty.com/incidents/P7600ZK - [#51119] Firing 1 - Alertmanager is failing sending notifications
-
https://gitlab.pagerduty.com/incidents/PG2Z97R - [#51123] Firing 1 - Alertmanager is failing sending notifications
-
https://gitlab.pagerduty.com/incidents/PW4XHFE - [#51127] Firing 1 - Alertmanager is failing sending notifications
Unactionable alerts:
Resolved production incidents:
-
production#5160 (closed) - 2021-07-14: Alertmanager is failing sending notifications
-
production#5159 (closed) - 2021-07-14: goserver_op_service SLI of the gitaly service on file-42 has an apdex violating SLO
Mitigated production incidents:
Expand for list of Mitigated Incidents
-
production#5158 (closed) - 2021-07-13: The rails_redis_client SLI of the redis-sidekiq service (
main
stage) has an apdex violating SLO -
production#5157 (closed) - 2021-07-13: Alert Manager webhook integration failing
-
production#5155 (closed) - 2021-07-13: Blackbox probe failures for docs.gitlab.com and next.gitlab.com
-
production#5154 (closed) - 2021-07-13: Multiple thanos query front-end errors
-
production#5152 (closed) - 2021-07-13: Increase in errors across multiple GitLab.com services
-
production#5149 (closed) - 2021-07-13: registry service has an error rate violating SLO
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5139
-
production#5130 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node
file-07-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5128 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node
file-hdd-05-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5119 (closed) - 2021-07-08: MigrateMergeRequestDiffCommitUsers producing long-running transactions
-
production#5117 (closed) - 2021-07-07: Increased API error rate on nginx-ingress in us-east1-d
-
production#5072 (closed) - 2021-07-04: Queue delay for sidekiq memory-bound shard
-
production#5070 (closed) - 2021-07-04: goserver_op_service abnormal Apdex on file-48
-
production#5069 (closed) - 2021-07-03: Multiple apdex alerts possibly related to postgres
-
production#5067 (closed) - 2021-07-02: Apdex dip for OpService on Gitaly node file-37
-
production#5066 (closed) - 2021-07-02: Apdex spike on Gitaly node file-praefect-02
-
production#5057 (closed) - 2021-07-01: Apdex and error rate spikes on 3 praefect-fronted gitaly nodes