An error occurred while fetching the assigned iteration of the selected issue.

On-Call Handover 2021-07-15 07:00 UTC

On-Call Handover

Brought to you by the Slack slash command: /sre-oncall handover

EOC egress: @devin
EOC ingress: @ahmadsherif

Summary:

We've got the Sidekiq stuff under control for the moment, and a plan moving forward. The quickest and easiest way to understand it is to read this comment production#5158 (comment 626570436) - It's the most digestible piece of the huge amount of information out there. I'd recommend reading over this right away and reading everything else at your leisure.

After reading that comment, it would be useful to read through the #rapid-action-sidekiq-incident and #incident-5158 channels in Slack, and their associated issues/epics. I'm not going to even summarize, since those do a much more thorough job of laying it all out.

Short answer is that there are no urgent actions for the EMEA on-call to take.

What (if any) time-critical work is being handed over?

What contextual info may be useful for the next few on-call shifts?

Ongoing alerts/incidents:

production#5162 (closed) - 2021-07-14: The goserver SLI of the gitaly service on node file-praefect-02-stor-gprd.c.gitlab-production.internal has an apdex violating SLO
production#5160 (closed) - 2021-07-14: Alertmanager is failing sending notifications
production#5159 (closed) - 2021-07-14: goserver_op_service SLI of the gitaly service on file-42 has an apdex violating SLO
production#5134 (closed) - 2021-07-09 - CI jobs using alpine 3.14 based images are failing
production#5104 (closed) - 2021-07-06: Thanos unhappy
production#5079 (closed) - 2021-07-05: Apdex dip on Gitaly node file-37 due to flood of UserMergeToRef calls from single project
production#5068 (closed) - 2021-07-02: Intermittent "Internal API unreachable" errors from Gitaly nodes
production#5062 (closed) - 2021-07-01: High disk usage by thanos-store persistent-volume-claim

Resolved actionable alerts:

Unactionable alerts:

Resolved production incidents:

Mitigated production incidents:

Expand for list of Mitigated Incidents

production#5161 (closed) - 2021-07-14: The goserver SLI of the gitaly service on node file-43-stor-gprd.c.gitlab-production.internal has an apdex violating SLO
production#5158 (closed) - 2021-07-13: The rails_redis_client SLI of the redis-sidekiq service (main stage) has an apdex violating SLO
production#5157 (closed) - 2021-07-13: Alert Manager webhook integration failing
production#5155 (closed) - 2021-07-13: Blackbox probe failures for docs.gitlab.com and next.gitlab.com
production#5154 (closed) - 2021-07-13: Multiple thanos query front-end errors
production#5152 (closed) - 2021-07-13: Increase in errors across multiple GitLab.com services
production#5149 (closed) - 2021-07-13: registry service has an error rate violating SLO
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5139
production#5130 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node file-07-stor-gprd.c.gitlab-production.internal has an apdex violating SLO
production#5128 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node file-hdd-05-stor-gprd.c.gitlab-production.internal has an apdex violating SLO
production#5119 (closed) - 2021-07-08: MigrateMergeRequestDiffCommitUsers producing long-running transactions
production#5117 (closed) - 2021-07-07: Increased API error rate on nginx-ingress in us-east1-d
production#5072 (closed) - 2021-07-04: Queue delay for sidekiq memory-bound shard
production#5070 (closed) - 2021-07-04: goserver_op_service abnormal Apdex on file-48
production#5069 (closed) - 2021-07-03: Multiple apdex alerts possibly related to postgres
production#5067 (closed) - 2021-07-02: Apdex dip for OpService on Gitaly node file-37
production#5066 (closed) - 2021-07-02: Apdex spike on Gitaly node file-praefect-02
production#5057 (closed) - 2021-07-01: Apdex and error rate spikes on 3 praefect-fronted gitaly nodes

Change issues:

In Progress

Closed

production#5146 (closed) - 2021-07-12: Use probe_jobs_limit for gitlab-exporter on Sidekiq

Edited 3 years ago by Devin Sylva

Designs

An error occurred while loading designs. Please try again.

Child items 0

GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action

No child items are currently open.