An error occurred while fetching the assigned iteration of the selected issue.

On-Call Handover 2021-07-13 07:00 UTC

On-Call Handover

Brought to you by the Slack slash command: /sre-oncall handover

The big one for today was: production#5148 (closed) sidekiq got way behind and saturated Redis. It took about 3 hours to drain and process all of the backlog. Root cause is still undetermined, but seems most likely it was network interruption at this point. It could happen again, so it'd be good to read over and be prepared.
The SLI for ci-runners is still a little touchy due to the earlier sidekiq incident eating up so much of our SLO budget. It should calm down. All of the spikes look like normal behavior, they're just triggering easier after the incident.

production#5134 (closed) - 2021-07-09 - CI jobs using alpine 3.14 based images are failing
production#5104 (closed) - 2021-07-06: Thanos unhappy
production#5079 (closed) - 2021-07-05: Apdex dip on Gitaly node file-37 due to flood of UserMergeToRef calls from single project
production#5068 (closed) - 2021-07-02: Intermittent "Internal API unreachable" errors from Gitaly nodes
production#5062 (closed) - 2021-07-01: High disk usage by thanos-store persistent-volume-claim
https://gitlab.pagerduty.com/incidents/PJ885VI - [#50474] Firing 1 - The Kube Persistent Volume Claim inode Utilisation resource of the kube service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
https://gitlab.pagerduty.com/incidents/P9CZXWY - [#50517] Firing 1 - GitLab Job has failed

https://gitlab.pagerduty.com/incidents/PUEWRZ4 - [#50440] Firing 1 - The grafana SLI of the monitoring service (main stage) has an error rate violating SLO
https://gitlab.pagerduty.com/incidents/P4OAH8G - [#50457] Firing 1 - The thanos_query_frontend SLI of the monitoring service (main stage) has an error rate violating SLO
https://gitlab.pagerduty.com/incidents/PY7JU4S - [#50469] Firing 1 - Large amount of Sidekiq Queued jobs
https://gitlab.pagerduty.com/incidents/PU6JXYT - [#50472] Firing 1 - The Redis Primary CPU Utilization per Node resource of the redis-sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
https://gitlab.pagerduty.com/incidents/P3WBB1H - [#50480] Firing 1 - The thanos_query_frontend SLI of the monitoring service (main stage) has an error rate violating SLO
https://gitlab.pagerduty.com/incidents/PTSCQE4 - [#50512] Firing 1 - The thanos_query_frontend SLI of the monitoring service (main stage) has an error rate violating SLO

Expand for list of Mitigated Incidents

production#5144 (closed) - 2021-07-12: The goserver_op_service SLI of the gitaly service on node file-11-stor-gprd.c.gitlab-production.internal has an apdex violating SLO
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5139
production#5130 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node file-07-stor-gprd.c.gitlab-production.internal has an apdex violating SLO
production#5128 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node file-hdd-05-stor-gprd.c.gitlab-production.internal has an apdex violating SLO
production#5119 (closed) - 2021-07-08: MigrateMergeRequestDiffCommitUsers producing long-running transactions
production#5117 (closed) - 2021-07-07: Increased API error rate on nginx-ingress in us-east1-d
production#5072 (closed) - 2021-07-04: Queue delay for sidekiq memory-bound shard
production#5070 (closed) - 2021-07-04: goserver_op_service abnormal Apdex on file-48
production#5069 (closed) - 2021-07-03: Multiple apdex alerts possibly related to postgres
production#5067 (closed) - 2021-07-02: Apdex dip for OpService on Gitaly node file-37
production#5066 (closed) - 2021-07-02: Apdex spike on Gitaly node file-praefect-02
production#5057 (closed) - 2021-07-01: Apdex and error rate spikes on 3 praefect-fronted gitaly nodes
production#5047 (closed) - 2021-06-30 Postgres slowdown due to overloading of slow requests
production#5045 (closed) - 2021-06-30: Brief increase in error rates in the API services

An error occurred while loading designs. Please try again.

GraphQL error: The resource that you are attempting to access does not exist or you don't have permission to perform this action

No child items are currently open.