On-Call Handover 2021-07-24 15:00 UTC
On-Call Handover
Brought to you by the Slack slash command: /sre-oncall handover
- EOC egress: @alejandro
- EOC ingress: @cmcfarland
Summary:
What (if any) time-critical work is being handed over?
What contextual info may be useful for the next few on-call shifts?
Ongoing alerts/incidents:
-
production#5218 (closed) - 2020-07-23: about.gitlab.com broken UI due to old CSS 404ing
-
production#5212 (closed) - 2021-07-22: Service desk emails are not being processed
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5193
-
production#5187 (closed) - 2021-07-19: The shard_memory_bound SLI of the sidekiq service (
main
stage) has not received any traffic in the past 30 minutes -
production#5169 (closed) - 2021-07-15: The goserver_op_service SLI of the gitaly service on node
file-01-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5160 (closed) - 2021-07-14: Alertmanager is failing sending notifications
-
production#5159 (closed) - 2021-07-14: goserver_op_service SLI of the gitaly service on file-42 has an apdex violating SLO
-
production#5134 (closed) - 2021-07-09 - CI jobs using alpine 3.14 based images are failing
-
production#5079 (closed) - 2021-07-05: Apdex dip on Gitaly node file-37 due to flood of UserMergeToRef calls from single project
-
production#5062 (closed) - 2021-07-01: High disk usage by thanos-store persistent-volume-claim
Resolved actionable alerts:
Unactionable alerts:
Resolved production incidents:
Mitigated production incidents:
Expand for list of Mitigated Incidents
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5217
-
production#5213 (closed) - 2021-07-22: Thanos unavailable
-
production#5206 (closed) - 2021-07-21: The goserver_op_service SLI of the gitaly service on node
file-59-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5203 (closed) - 2021-07-21: thanos query front end error rate is high
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5191
-
production#5186 (closed) - 2021-07-19: The goserver SLI of the gitaly service on node
file-04-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5182 (closed) - 2021-07-19: Intermittent QA test failures in staging
-
production#5173 (closed) - 2021-07-15: Blackbox probes for https://pre.gitlab.com are failing
-
production#5172 (closed) - 2021-07-15: The goserver_op_service SLI of the gitaly service on node
file-27-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5168 (closed) - 2021-07-15: web-pages-{01,02} have empty chef run list
-
https://gitlab.com/gitlab-com/gl-infra/production/-/issues/5165 - 2021-07-15: The nginx_ingress SLI of the api service in region
us-east1-d
has an error rate violating SLO -
production#5162 (closed) - 2021-07-14: The goserver SLI of the gitaly service on node
file-praefect-02-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5161 (closed) - 2021-07-14: The goserver SLI of the gitaly service on node
file-43-stor-gprd.c.gitlab-production.internal
has an apdex violating SLO -
production#5158 (closed) - 2021-07-13: The rails_redis_client SLI of the redis-sidekiq service (
main
stage) has an apdex violating SLO -
production#5157 (closed) - 2021-07-13: Alert Manager webhook integration failing
-
production#5155 (closed) - 2021-07-13: Blackbox probe failures for docs.gitlab.com and next.gitlab.com
-
production#5154 (closed) - 2021-07-13: Multiple thanos query front-end errors
-
production#5152 (closed) - 2021-07-13: Increase in errors across multiple GitLab.com services
-
production#5149 (closed) - 2021-07-13: registry service has an error rate violating SLO
-
production#5117 (closed) - 2021-07-07: Increased API error rate on nginx-ingress in us-east1-d