On-Call Handover 2021-07-13 07:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @devin
- EOC ingress: @ahmadsherif
Summary:
- The big one for today was: production#5148 (closed) sidekiq got way behind and saturated Redis. It took about 3 hours to drain and process all of the backlog. Root cause is still undetermined, but seems most likely it was network interruption at this point. It could happen again, so it'd be good to read over and be prepared.
- The SLI for ci-runners is still a little touchy due to the earlier sidekiq incident eating up so much of our SLO budget. It should calm down. All of the spikes look like normal behavior, they're just triggering easier after the incident.
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
Ongoing alerts/incidents:
-
production#5134 (closed) - 2021-07-09 - CI jobs using alpine 3.14 based images are failing
-
production#5104 (closed) - 2021-07-06: Thanos unhappy
-
production#5079 (closed) - 2021-07-05: Apdex dip on Gitaly node file-37 due to flood of UserMergeToRef calls from single project
-
production#5068 (closed) - 2021-07-02: Intermittent "Internal API unreachable" errors from Gitaly nodes
-
production#5062 (closed) - 2021-07-01: High disk usage by thanos-store persistent-volume-claim
-
https://gitlab.pagerduty.com/incidents/PJ885VI - [#50474] Firing 1 - The Kube Persistent Volume Claim inode Utilisation resource of the kube service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
-
https://gitlab.pagerduty.com/incidents/P9CZXWY - [#50517] Firing 1 - GitLab Job has failed
Resolved actionable alerts:
-
https://gitlab.pagerduty.com/incidents/PUEWRZ4 - [#50440] Firing 1 - The grafana SLI of the monitoring service (
main
stage) has an error rate violating SLO -
https://gitlab.pagerduty.com/incidents/P4OAH8G - [#50457] Firing 1 - The thanos_query_frontend SLI of the monitoring service (
main
stage) has an error rate violating SLO -
https://gitlab.pagerduty.com/incidents/PY7JU4S - [#50469] Firing 1 - Large amount of Sidekiq Queued jobs
-
https://gitlab.pagerduty.com/incidents/PU6JXYT - [#50472] Firing 1 - The Redis Primary CPU Utilization per Node resource of the redis-sidekiq service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
-
https://gitlab.pagerduty.com/incidents/P3WBB1H - [#50480] Firing 1 - The thanos_query_frontend SLI of the monitoring service (
main
stage) has an error rate violating SLO -
https://gitlab.pagerduty.com/incidents/PTSCQE4 - [#50512] Firing 1 - The thanos_query_frontend SLI of the monitoring service (
main
stage) has an error rate violating SLO
Unactionable alerts:
Resolved production incidents:
Mitigated production incidents:
Expand for list of Mitigated Incidents
-
production#5144 (closed) - 2021-07-12: The goserver_op_service SLI of the gitaly service on node
file-11-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5139
-
production#5130 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node
file-07-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5128 (closed) - 2021-07-09: The goserver SLI of the gitaly service on node
file-hdd-05-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5119 (closed) - 2021-07-08: MigrateMergeRequestDiffCommitUsers producing long-running transactions
-
production#5117 (closed) - 2021-07-07: Increased API error rate on nginx-ingress in us-east1-d
-
production#5072 (closed) - 2021-07-04: Queue delay for sidekiq memory-bound shard
-
production#5070 (closed) - 2021-07-04: goserver_op_service abnormal Apdex on file-48
-
production#5069 (closed) - 2021-07-03: Multiple apdex alerts possibly related to postgres
-
production#5067 (closed) - 2021-07-02: Apdex dip for OpService on Gitaly node file-37
-
production#5066 (closed) - 2021-07-02: Apdex spike on Gitaly node file-praefect-02
-
production#5057 (closed) - 2021-07-01: Apdex and error rate spikes on 3 praefect-fronted gitaly nodes
-
production#5047 (closed) - 2021-06-30 Postgres slowdown due to overloading of slow requests
-
production#5045 (closed) - 2021-06-30: Brief increase in error rates in the API services