On-Call Handover 2021-07-20 15:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @alejandro
- EOC ingress: @cmcfarland
Summary:
Only alert was due to an expired silence for production#5062 (closed). Otherwise completely silent! Sweet!
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
Ongoing alerts/incidents:
-
production#5189 (closed) - 2021-07-19: Alertmanager Notifications Failing
-
production#5187 (closed) - 2021-07-19: The shard_memory_bound SLI of the sidekiq service (
main
stage) has not received any traffic in the past 30 minutes -
production#5169 (closed) - 2021-07-15: The goserver_op_service SLI of the gitaly service on node
file-01-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5160 (closed) - 2021-07-14: Alertmanager is failing sending notifications
-
production#5159 (closed) - 2021-07-14: goserver_op_service SLI of the gitaly service on file-42 has an apdex violating SLO
-
production#5134 (closed) - 2021-07-09 - CI jobs using alpine 3.14 based images are failing
-
production#5079 (closed) - 2021-07-05: Apdex dip on Gitaly node file-37 due to flood of UserMergeToRef calls from single project
-
production#5062 (closed) - 2021-07-01: High disk usage by thanos-store persistent-volume-claim
-
https://gitlab.pagerduty.com/incidents/PN8OCB0 - [#52768] Firing 1 - The Kube Persistent Volume Claim Space Utilisation resource of the kube service (main stage), component has a saturation exceeding SLO and is close to its capacity limit.
Resolved actionable alerts:
Unactionable alerts:
Resolved production incidents:
Mitigated production incidents:
Expand for list of Mitigated Incidents
-
production#5186 (closed) - 2021-07-19: The goserver SLI of the gitaly service on node
file-04-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5182 (closed) - 2021-07-19: Intermittent QA test failures in staging
-
production#5173 (closed) - 2021-07-15: Blackbox probes for https://pre.gitlab.com are failing
-
production#5172 (closed) - 2021-07-15: The goserver_op_service SLI of the gitaly service on node
file-27-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5168 (closed) - 2021-07-15: web-pages-{01,02} have empty chef run list
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5165 - 2021-07-15: The nginx_ingress SLI of the api service in region
us-east1-d
has an error rate violating SLO -
production#5162 (closed) - 2021-07-14: The goserver SLI of the gitaly service on node
file-praefect-02-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5161 (closed) - 2021-07-14: The goserver SLI of the gitaly service on node
file-43-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5158 (closed) - 2021-07-13: The rails_redis_client SLI of the redis-sidekiq service (
main
stage) has an apdex violating SLO -
production#5157 (closed) - 2021-07-13: Alert Manager webhook integration failing
-
production#5155 (closed) - 2021-07-13: Blackbox probe failures for docs.gitlab.com and next.gitlab.com
-
production#5154 (closed) - 2021-07-13: Multiple thanos query front-end errors
-
production#5152 (closed) - 2021-07-13: Increase in errors across multiple GitLab.com services
-
production#5149 (closed) - 2021-07-13: registry service has an error rate violating SLO
-
production#5117 (closed) - 2021-07-07: Increased API error rate on nginx-ingress in us-east1-d
-
production#5072 (closed) - 2021-07-04: Queue delay for sidekiq memory-bound shard
-
production#5070 (closed) - 2021-07-04: goserver_op_service abnormal Apdex on file-48
-
production#5069 (closed) - 2021-07-03: Multiple apdex alerts possibly related to postgres
-
production#5068 (closed) - 2021-07-02: Intermittent "Internal API unreachable" errors from Gitaly nodes
-
production#5067 (closed) - 2021-07-02: Apdex dip for OpService on Gitaly node file-37
Change issues:
In Progress
Closed
-
production#5131 (closed) - 2021-09-12: Update VM image for private and gitlab-org runners
-
production#5077 (closed) - Update ops VM to Ubuntu 20.04