On-Call Handover 2021-07-15 07:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @devin
- EOC ingress: @ahmadsherif
Summary:
We've got the Sidekiq stuff under control for the moment, and a plan moving forward. The quickest and easiest way to understand it is to read this comment production#5158 (comment 626570436) - It's the most digestible piece of the huge amount of information out there. I'd recommend reading over this right away and reading everything else at your leisure.
After reading that comment, it would be useful to read through the #rapid-action-sidekiq-incident
and #incident-5158
channels in Slack, and their associated issues/epics. I'm not going to even summarize, since those do a much more thorough job of laying it all out.
Short answer is that there are no urgent actions for the EMEA on-call to take.
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
Ongoing alerts/incidents:
-
production#5162 (closed) - 2021-07-14: The goserver SLI of the gitaly service on node
file-praefect-02-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5160 (closed) - 2021-07-14: Alertmanager is failing sending notifications
-
production#5159 (closed) - 2021-07-14: goserver_op_service SLI of the gitaly service on file-42 has an apdex violating SLO
-
production#5134 (closed) - 2021-07-09 - CI jobs using alpine 3.14 based images are failing
-
production#5104 (closed) - 2021-07-06: Thanos unhappy
-
production#5079 (closed) - 2021-07-05: Apdex dip on Gitaly node file-37 due to flood of UserMergeToRef calls from single project
-
production#5068 (closed) - 2021-07-02: Intermittent "Internal API unreachable" errors from Gitaly nodes
-
production#5062 (closed) - 2021-07-01: High disk usage by thanos-store persistent-volume-claim
Resolved actionable alerts:
Unactionable alerts:
Resolved production incidents:
Mitigated production incidents:
Expand for list of Mitigated Incidents
-
production#5161 (closed) - 2021-07-14: The goserver SLI of the gitaly service on node
file-43-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5158 (closed) - 2021-07-13: The rails_redis_client SLI of the redis-sidekiq service (
main
stage) has an apdex violating SLO -
production#5157 (closed) - 2021-07-13: Alert Manager webhook integration failing
-
production#5155 (closed) - 2021-07-13: Blackbox probe failures for docs.gitlab.com and next.gitlab.com
-
production#5154 (closed) - 2021-07-13: Multiple thanos query front-end errors
-
production#5152 (closed) - 2021-07-13: Increase in errors across multiple GitLab.com services
-
production#5149 (closed) - 2021-07-13: registry service has an error rate violating SLO
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5139
-
production#5130 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node
file-07-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5128 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node
file-hdd-05-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5119 (closed) - 2021-07-08: MigrateMergeRequestDiffCommitUsers producing long-running transactions
-
production#5117 (closed) - 2021-07-07: Increased API error rate on nginx-ingress in us-east1-d
-
production#5072 (closed) - 2021-07-04: Queue delay for sidekiq memory-bound shard
-
production#5070 (closed) - 2021-07-04: goserver_op_service abnormal Apdex on file-48
-
production#5069 (closed) - 2021-07-03: Multiple apdex alerts possibly related to postgres
-
production#5067 (closed) - 2021-07-02: Apdex dip for OpService on Gitaly node file-37
-
production#5066 (closed) - 2021-07-02: Apdex spike on Gitaly node file-praefect-02
-
production#5057 (closed) - 2021-07-01: Apdex and error rate spikes on 3 praefect-fronted gitaly nodes
Change issues:
In Progress
Closed
- production#5146 (closed) - 2021-07-12: Use probe_jobs_limit for gitlab-exporter on Sidekiq